Webscraping with F# and Selenium
Date: 2023-09-06 | create | tech | fsharp | selenium | dotnet |
I've been working on a project that scrapes and compiles data from various webpages on an interval. I wanted to do this with F# but there wasn't much information on how to do this and various simple approaches I tried failed.
In this post we're going to try to answer the question:
Q: How to get basic webscraping working with F#?
Answer
For this solution, I built an F# script that uses Selenium to navigate to the New York Times homepage (a digital newspaper), take a screenshot of the page, and parse / output all article titles.
We'll dive into how this works, including covering webscraping essentials like:
- Spinning up a browser-like environment
- Driving the browser with F#
- Taking screenshots of webpages
- Parsing webpage elements and extracting information
Solution Overview
The full solution runs a simple F# script in a Docker container orchestrated with Docker Compose.
- Docker Compose - Used to orchestrate container creation. Useful in this case as we'll be making our local filesystem available to the container to store its screenshots which requires extra configuration - I prefer static configuration over CLI args for determinisim.
- Docker - The container that installs all necessary webscraping dependencies, builds our code, and runs the solution. Useful as it's theoretically deterministic across machines with Docker installed.
- F# - Drives the Selenium webscraper.
Selenium for webscraping - There are many libraries available for webscraping in the dotnet ecosystem. I chose Selenium because it is feature-rich, well-supported, and is popular across languages.
The containerization may seem like overkill and it is but I've found that the extra effort required for containerization pays dividends with more deterministic script runs across machines. You can do all of this without a container but IME setting up all the Selenium drivers / browsers manually is a pain.
Docker Container
The Docker container handles installing dependencies and running our project.
In particular, it handles:
- Setting up Selenium and its required browser / driver dependencies by pulling the official Selenium Chrome image (if you want to use a different browser (FireFox, Edge) you can change that here)
- Building the F# project
- Running the F# project
Dockerfile
# **Build Project**
# https://hub.docker.com/_/microsoft-dotnet
FROM mcr.microsoft.com/dotnet/sdk:7.0 AS build
EXPOSE 80
WORKDIR /source
# Copy fsproj and restore all dependencies
COPY ./*.fsproj ./
RUN dotnet restore
# Copy source code and build / publish app and libraries
COPY . .
RUN dotnet publish -c release -o /app --self-contained -r linux-x64
# **Run project**
# Create new layer with Selenium.Chrome
FROM selenium/standalone-chrome
WORKDIR /app
# Copy and run code
COPY --from=build /app .
ENTRYPOINT ["sudo", "./fetch-webpage-console"]
Highlights:
- We first build our F# project and export it as a standalone executable so that we can run it in the
selenium/standalone-chrome
layer which has our webscraping dependencies
For more information, read Run F# / .NET in Docker (Console App)
We utilize docker-compose to encode more of our infrastructure-as-code. It's useful here because our project assumes it has access to a filesystem to store its images. We could pass in the volume as part of our Docker cli command but that's easy to miss so formalizing this in our docker-compose allows us to make our runs more deterministic / less likely to leave out commands.
docker-compose.yml
version: "3"
services:
fetch-nyt-console:
build:
context: ./
dockerfile: ./Dockerfile
container_name: fetch-nyt-console
volumes:
- ./ScreenshotsOut:/usr/src/app/ScreenshotsOut
Highlights:
- We mostly need this for the
volumes
section so we ensure we can save our screenshots somewhere we can find them later (to our local filesystem)
F# Project
Our F# project does a few things:
- Create Driver - Creates a
ChromeDriver
which allows it to spin up and control a Google Chrome browser - Go to webpage - Drives the browser to the New York Times website and prints out the title to sanity check it was successful
- Deal with popup - Checks if a compliance overlay is present and clicks continue if so - this is a pop up that started appearing on the site and I thought it was a good example of how to deal with random stuff like this
- Take Screenshot - Takes a screenshot of the webpage and saves it to the directory in the container we've attached our local filesystem volume to
- Parse all titles - Parses all article titles on the page, finding them via
h3
tags and prints them out
Program.fs
printfn "Running webscraper"
open System
open OpenQA.Selenium
open OpenQA.Selenium.Chrome
open OpenQA.Selenium.Support
// Create driver
let mutable options = new ChromeOptions()
options.AddArguments([
"--verbose";
"--headless";
"--disable-dev-shm-usage"
])
let driver = new ChromeDriver(options)
// Navigate to webpage
driver.Navigate().GoToUrl("https://www.nytimes.com/")
printfn "Title: %A" driver.Title
// Deal with compliance overlay
let complianceOverlayElements = driver.FindElements(By.Id("complianceOverlay"))
let isComplianceOverlayPresent =
complianceOverlayElements.Count > 0
match isComplianceOverlayPresent with
| true ->
complianceOverlayElements[0]
.FindElement(By.TagName("button"))
.Click()
| false -> ()
// Take Screenshot
let screenshot = (
driver
.GetScreenshot())
(screenshot
.SaveAsFile(
$"/usr/src/app/ScreenshotsOut/{Guid.NewGuid().ToString()}.png",
ScreenshotImageFormat.Png))
// Get all article titles
let allArticleTitles =
(driver.FindElements(By.TagName("h3")))
|> Seq.map (
fun e ->
e.Text
)
|> Seq.filter (
fun t -> t.Length > 0
)
|> Seq.toList
printfn "AllArticleTitles: %A" allArticleTitles
fetch-webpage-console.fsproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net7.0</TargetFramework>
<RootNamespace>fetch_webpage_console</RootNamespace>
</PropertyGroup>
<ItemGroup>
<Compile Include="Program.fs" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Selenium.WebDriver" Version="4.11.0" />
<PackageReference Include="Selenium.WebDriver.ChromeDriver" Version="115.0.5790.17000" />
</ItemGroup>
</Project>
Next Steps
That's everything I did to make a 3S webscraper using F# and Selenium!
HAMINIONs members can browse the full webscraping project code.
If you want to learn more about building with F#, checkout Build a simple F# web API with Giraffe.
Want more like this?
The best / easiest way to support my work is by subscribing for future updates and sharing with your network.