Essay - Published: 2023.09.13 | csharp | selenium | webscraping |
DISCLOSURE: If you buy through affiliate links, I may earn a small commission. (disclosures)
Last week I released a short tutorial on Containerized Webscraping with F# and Selenium. This week I thought it'd be fun to do the same with C# - showcasing just how similar these two dotnet dialects can be.
In this post we'll be focused on answering the question:
Q: How to create a C# webscraper using Docker and Selenium?
In this post, I'll be sharing how I created a simple webscraper using C# and Selenium, runnable as a Docker container.
This project:
We'll go over:
All source code is available in this post and HAMINIONs subscribers get access to the full project files.
At a high level, we have 3 components.
If you read Containerized Webscraping with F# and Selenium, you'll note that most of this structure / code is very similar. That's because both run on dotnet so they get to use the same libraries under the hood.
Dockerfile
selenium/standalone-chrome)
Dockerfile
# **Build Project**
# https://hub.docker.com/_/microsoft-dotnet
FROM mcr.microsoft.com/dotnet/sdk:7.0 AS build
EXPOSE 80
WORKDIR /source
# Copy fsproj and restore all dependencies
COPY ./*.csproj ./
RUN dotnet restore
# Copy source code and build / publish app and libraries
COPY . .
RUN dotnet publish -c release -o /app --self-contained -r linux-x64
# **Run project**
# Create new layer with Selenium.Chrome
FROM selenium/standalone-chrome
WORKDIR /app
# Copy and run code
COPY --from=build /app .
ENTRYPOINT ["sudo", "./fetch-nyt-console-cs"]
Docker-compose
The main thing we're using docker-compose for is to configure our volumes - this allows us to attach a folder from our local computer to a folder inside the container which is useful if you want to share files across the container boundary. For our usecase we want this so we can take screenshots inside the container and have them saved to our local filesystem so we can access them later.
While we're at it, we name our container so we don't need to use tag flags in the Docker cli command.
docker-compose.yml
version: "3"
services:
fetch-nyt-console-cs:
build:
context: ./
dockerfile: ./Dockerfile
container_name: fetch-nyt-console-cs
volumes:
- ./ScreenshotsOut:/usr/src/app/ScreenshotsOut
With both our Docker and Docker Compose files we can run our whole app (from downloading and installing dependencies to building and running our app code) with a simple command:
docker-compose down --remove-orphans && docker-compose build && docker-compose up
Now that all of our infrastructure is configured in code, we can focus on the actual app logic - scraping the New York Times website.
Prerequisites:
Our code:
h3 tags (most of their titles seem to be in h3) and prints them outusing System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support;
Console.WriteLine("Running C# Webscraper");
// Create driver
var options = new ChromeOptions();
options.AddArguments(
new List<string> {
"--verbose",
"--headless",
"--disable-dev-shm-usage"
}
);
var driver = new ChromeDriver(options);
// Navigate to webpage
driver
.Navigate()
.GoToUrl("https://www.nytimes.com/");
Console.WriteLine($"Title: {driver.Title}");
// Deal with compliance overlay
var complianceOverlayElements = driver
.FindElements(
By.Id("complianceOverlay")
);
var isComplianceOverlayPresent = complianceOverlayElements.Count > 0;
if(isComplianceOverlayPresent) {
complianceOverlayElements[0]
.FindElement(
By.TagName("button"))
.Click();
}
// Take Screenshot
var screenshot = driver
.GetScreenshot();
screenshot
.SaveAsFile(
$"/usr/src/app/ScreenshotsOut/{Guid.NewGuid().ToString()}.png",
ScreenshotImageFormat.Png
);
// Get all article titles
var allArticleTitles = driver
.FindElements(
By.TagName("h3")
).Select(e => e.Text)
.Where(t => t.Length > 0)
.ToList();
allArticleTitles.ForEach(t => Console.WriteLine(t));
There you have it - simple containerized webscraping with C# and Selenium.
If you liked this, you might be interested in:
The best way to support my work is to like / comment / share for the algorithm and subscribe for future updates.