Webscraping with F# and Selenium

Date: 2023-09-06 | create | tech | fsharp | selenium | dotnet |

I've been working on a project that scrapes and compiles data from various webpages on an interval. I wanted to do this with F# but there wasn't much information on how to do this and various simple approaches I tried failed.

In this post we're going to try to answer the question:

Q: How to get basic webscraping working with F#?

Answer

For this solution, I built an F# script that uses Selenium to navigate to the New York Times homepage (a digital newspaper), take a screenshot of the page, and parse / output all article titles.

We'll dive into how this works, including covering webscraping essentials like:

  • Spinning up a browser-like environment
  • Driving the browser with F#
  • Taking screenshots of webpages
  • Parsing webpage elements and extracting information

Solution Overview

The full solution runs a simple F# script in a Docker container orchestrated with Docker Compose.

  • Docker Compose - Used to orchestrate container creation. Useful in this case as we'll be making our local filesystem available to the container to store its screenshots which requires extra configuration - I prefer static configuration over CLI args for determinisim.
  • Docker - The container that installs all necessary webscraping dependencies, builds our code, and runs the solution. Useful as it's theoretically deterministic across machines with Docker installed.
  • F# - Drives the Selenium webscraper.

Selenium for webscraping - There are many libraries available for webscraping in the dotnet ecosystem. I chose Selenium because it is feature-rich, well-supported, and is popular across languages.

The containerization may seem like overkill and it is but I've found that the extra effort required for containerization pays dividends with more deterministic script runs across machines. You can do all of this without a container but IME setting up all the Selenium drivers / browsers manually is a pain.

Docker Container

The Docker container handles installing dependencies and running our project.

In particular, it handles:

  • Setting up Selenium and its required browser / driver dependencies by pulling the official Selenium Chrome image (if you want to use a different browser (FireFox, Edge) you can change that here)
  • Building the F# project
  • Running the F# project

Dockerfile

# **Build Project**
# https://hub.docker.com/_/microsoft-dotnet
FROM mcr.microsoft.com/dotnet/sdk:7.0 AS build
EXPOSE 80

WORKDIR /source

# Copy fsproj and restore all dependencies
COPY ./*.fsproj ./
RUN dotnet restore

# Copy source code and build / publish app and libraries
COPY . .
RUN dotnet publish -c release -o /app --self-contained -r linux-x64

# **Run project**
# Create new layer with Selenium.Chrome
FROM selenium/standalone-chrome
WORKDIR /app

# Copy and run code
COPY --from=build /app .
ENTRYPOINT ["sudo", "./fetch-webpage-console"]

Highlights:

  • We first build our F# project and export it as a standalone executable so that we can run it in the selenium/standalone-chrome layer which has our webscraping dependencies

For more information, read Run F# / .NET in Docker (Console App)

We utilize docker-compose to encode more of our infrastructure-as-code. It's useful here because our project assumes it has access to a filesystem to store its images. We could pass in the volume as part of our Docker cli command but that's easy to miss so formalizing this in our docker-compose allows us to make our runs more deterministic / less likely to leave out commands.

docker-compose.yml

version: "3"
services:
  fetch-nyt-console:
    build: 
      context: ./
      dockerfile: ./Dockerfile
    container_name: fetch-nyt-console
    volumes:
      - ./ScreenshotsOut:/usr/src/app/ScreenshotsOut

Highlights:

  • We mostly need this for the volumes section so we ensure we can save our screenshots somewhere we can find them later (to our local filesystem)

F# Project

Our F# project does a few things:

  • Create Driver - Creates a ChromeDriver which allows it to spin up and control a Google Chrome browser
  • Go to webpage - Drives the browser to the New York Times website and prints out the title to sanity check it was successful
  • Deal with popup - Checks if a compliance overlay is present and clicks continue if so - this is a pop up that started appearing on the site and I thought it was a good example of how to deal with random stuff like this
  • Take Screenshot - Takes a screenshot of the webpage and saves it to the directory in the container we've attached our local filesystem volume to
  • Parse all titles - Parses all article titles on the page, finding them via h3 tags and prints them out

Program.fs

printfn "Running webscraper"

open System
open OpenQA.Selenium
open OpenQA.Selenium.Chrome
open OpenQA.Selenium.Support

// Create driver

let mutable options = new ChromeOptions()

options.AddArguments([
    "--verbose";
    "--headless";
    "--disable-dev-shm-usage"
])

let driver = new ChromeDriver(options)

// Navigate to webpage

driver.Navigate().GoToUrl("https://www.nytimes.com/")
printfn "Title: %A" driver.Title

// Deal with compliance overlay

let complianceOverlayElements = driver.FindElements(By.Id("complianceOverlay"))

let isComplianceOverlayPresent = 
    complianceOverlayElements.Count > 0

match isComplianceOverlayPresent with 
| true -> 
    complianceOverlayElements[0]
        .FindElement(By.TagName("button"))
        .Click()
| false -> ()

// Take Screenshot

let screenshot = (
    driver
        .GetScreenshot())

(screenshot
    .SaveAsFile(
        $"/usr/src/app/ScreenshotsOut/{Guid.NewGuid().ToString()}.png",
        ScreenshotImageFormat.Png))

// Get all article titles

let allArticleTitles = 
    (driver.FindElements(By.TagName("h3")))
    |> Seq.map (
        fun e ->
            e.Text
    )
    |> Seq.filter (
        fun t -> t.Length > 0
    )
    |> Seq.toList

printfn "AllArticleTitles: %A" allArticleTitles

fetch-webpage-console.fsproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net7.0</TargetFramework>
    <RootNamespace>fetch_webpage_console</RootNamespace>
  </PropertyGroup>

  <ItemGroup>
    <Compile Include="Program.fs" />
  </ItemGroup>

  <ItemGroup>
    <PackageReference Include="Selenium.WebDriver" Version="4.11.0" />
    <PackageReference Include="Selenium.WebDriver.ChromeDriver" Version="115.0.5790.17000" />
  </ItemGroup>

</Project>

Next Steps

That's everything I did to make a 3S webscraper using F# and Selenium!

HAMINIONs members can browse the full webscraping project code.

If you want to learn more about building with F#, checkout Build a simple F# web API with Giraffe.

Want more like this?

The best / easiest way to support my work is by subscribing for future updates and sharing with your network.