Containerized Webscraping with C# and Selenium
Date: 2023-09-13 | csharp | selenium | webscraping |
Last week I released a short tutorial on Containerized Webscraping with F# and Selenium. This week I thought it'd be fun to do the same with C# - showcasing just how similar these two dotnet dialects can be.
In this post we'll be focused on answering the question:
Q: How to create a C# webscraper using Docker and Selenium?
Answer
In this post, I'll be sharing how I created a simple webscraper using C# and Selenium, runnable as a Docker container.
This project:
- Builds a container with all necessary Selenium dependencies
- Builds and runs the C# code in the container
- Scrapes the New York Times website for article titles and takes a screenshot
We'll go over:
- How this works from a high level
- Setting up Selenium's dependencies (and containerizing them)
- Controlling Selenium from C#
All source code is available in this post and HAMINIONs subscribers get access to the full project files.
How this Works
At a high level, we have 3 components.
- Docker / Docker-compose: Used to create the environment we need to run our code sucessfully (read: Infrastructure as Code)
- Selenium: The package we're using to create and control a web browser instance
- C# Project: Where we'll write our application code
If you read Containerized Webscraping with F# and Selenium, you'll note that most of this structure / code is very similar. That's because both run on dotnet so they get to use the same libraries under the hood.
Setting up Selenium
- Docker: Pulls in the official Selenium image which includes Browser and Webdriver confiugration, builds and runs application code to control Selenium
- Docker-compose: Used to configure our Docker container so we don't have to deal with CLI args (read: Infrastructure as Code)
Dockerfile
- Builds our C# project using dotnet sdk
- Creates a standalone executable targeting linux64 (most Docker images target linux64)
- Pulls the official Selenium chrome image (
selenium/standalone-chrome
)- Note: If you want to use a different browser like Firefox, Safari, etc - this is where you change that
- Copies the C# executable into this container layer and runs it
Dockerfile
# **Build Project**
# https://hub.docker.com/_/microsoft-dotnet
FROM mcr.microsoft.com/dotnet/sdk:7.0 AS build
EXPOSE 80
WORKDIR /source
# Copy fsproj and restore all dependencies
COPY ./*.csproj ./
RUN dotnet restore
# Copy source code and build / publish app and libraries
COPY . .
RUN dotnet publish -c release -o /app --self-contained -r linux-x64
# **Run project**
# Create new layer with Selenium.Chrome
FROM selenium/standalone-chrome
WORKDIR /app
# Copy and run code
COPY --from=build /app .
ENTRYPOINT ["sudo", "./fetch-nyt-console-cs"]
Docker-compose
The main thing we're using docker-compose
for is to configure our volumes - this allows us to attach a folder from our local computer to a folder inside the container which is useful if you want to share files across the container boundary. For our usecase we want this so we can take screenshots inside the container and have them saved to our local filesystem so we can access them later.
While we're at it, we name our container so we don't need to use tag
flags in the Docker cli command.
docker-compose.yml
version: "3"
services:
fetch-nyt-console-cs:
build:
context: ./
dockerfile: ./Dockerfile
container_name: fetch-nyt-console-cs
volumes:
- ./ScreenshotsOut:/usr/src/app/ScreenshotsOut
With both our Docker and Docker Compose files we can run our whole app (from downloading and installing dependencies to building and running our app code) with a simple command:
docker-compose down --remove-orphans && docker-compose build && docker-compose up
- Stops any existing instance and removes them from storage
- Rebuilds the entire image (takes longer, but removes any potential stale cache issues)
- Spins up new containers
Webscraping with C#
Now that all of our infrastructure is configured in code, we can focus on the actual app logic - scraping the New York Times website.
Prerequisites:
- nuget: Selenium.WebDriver
- nuget: Selenium.WebDriver.ChromeDriver
- Note: If you're using a browser other than Chrome (Firefox, Safari, etc), make sure to get the corresponding WebDriver
Our code:
- Opens the Selenium packages we'll be using
- Creates a Chrome webdriver with a few options (these options seemed to work the best from my research / experimentation)
- Navigates to the New York Times website
- Deals with the pop up if it exists (otherwise we can't see the full webpage)
- Takes a screenshot
- Searches the page for all
h3
tags (most of their titles seem to be in h3) and prints them out
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support;
Console.WriteLine("Running C# Webscraper");
// Create driver
var options = new ChromeOptions();
options.AddArguments(
new List<string> {
"--verbose",
"--headless",
"--disable-dev-shm-usage"
}
);
var driver = new ChromeDriver(options);
// Navigate to webpage
driver
.Navigate()
.GoToUrl("https://www.nytimes.com/");
Console.WriteLine($"Title: {driver.Title}");
// Deal with compliance overlay
var complianceOverlayElements = driver
.FindElements(
By.Id("complianceOverlay")
);
var isComplianceOverlayPresent = complianceOverlayElements.Count > 0;
if(isComplianceOverlayPresent) {
complianceOverlayElements[0]
.FindElement(
By.TagName("button"))
.Click();
}
// Take Screenshot
var screenshot = driver
.GetScreenshot();
screenshot
.SaveAsFile(
$"/usr/src/app/ScreenshotsOut/{Guid.NewGuid().ToString()}.png",
ScreenshotImageFormat.Png
);
// Get all article titles
var allArticleTitles = driver
.FindElements(
By.TagName("h3")
).Select(e => e.Text)
.Where(t => t.Length > 0)
.ToList();
allArticleTitles.ForEach(t => Console.WriteLine(t));
Next Steps
There you have it - simple containerized webscraping with C# and Selenium.
- Full C# / Selenium Webscraping project files
- Available to all HAMINIONs subscribers
If you liked this, you might be interested in:
Want more like this?
The best / easiest way to support my work is by subscribing for future updates and sharing with your network.