Menu

Idea Sparks

A blog exploring big ideas, community culture, and creative collaboration — from club dynamics to the future of online gatherings. An example of a featured first article blog.

Ant
Ant
Blog owner

Building “Trafilatura-as-a-Service”: Clean Web Scraping with FastAPI, Docker, and API Keys 14 Nov 2025 • 4 min read

Want to turn messy web pages into clean, AI-ready text? This guide walks you through building a lightweight FastAPI microservice powered by Trafilatura — complete with API key authentication and a fully dockerized deployment. Perfect for .NET, JavaScript, or Python apps that need reliable, language-agnostic content extraction.
Back to Blog Home

Want to turn messy web pages into clean, AI-ready text? This guide walks you through building a lightweight FastAPI microservice powered by Trafilatura — complete with API key authentication and a fully dockerized deployment. Perfect for .NET, JavaScript, or Python apps that need reliable, language-agnostic content extraction.

🚀 Building “Trafilatura-as-a-Service”: Clean Web Scraping with FastAPI, Docker, and API Keys

If you’ve ever scraped content from across the web, you know the real challenge isn’t downloading HTML — it’s getting the clean text hidden inside.

Navigation menus. Cookie banners. Endless sidebars.
When you're building RAG pipelines, AI agents, or search engines, this boilerplate becomes noise.

That’s why Trafilatura is such a powerful tool:
it performs industry-leading boilerplate removal, turning chaotic DOM structures into clean, meaningful text.

But what if your backend isn’t Python?
What if your main pipeline is C#, JavaScript, Go, or Rust?

👉 That’s where Trafilatura-as-a-Service comes in.
A tiny FastAPI microservice that:

Accepts URLs or raw HTML

Returns clean text

Works with any language

Is easy to scale, cache, or secure

Runs anywhere — including Docker


Let’s build it.

🧱 Why Build Trafilatura as a Microservice?

This architecture shines when:

Your main application isn’t Python

You want to isolate scraping/extraction in a dedicated service

You’re feeding text into embeddings (pgvector, Pinecone, Qdrant, Chroma)

You're processing lots of URLs, and want scalable, stateless workers

You want consistent extraction quality across projects/apps


A clean separation keeps your codebase simple:

Raw URL → Trafilatura API → Clean text → Chunk → Embed → Vector DB → RAG

And your scraper pipeline instantly becomes more maintainable.


🛠️ The Core FastAPI Service

Below is the full FastAPI app that powers the service.
It supports:

✔ URL-based extraction
✔ Raw HTML extraction
✔ Error handling
✔ Extraction options
✔ Threadpool offloading for Trafilatura

main.py

from __future__ import annotations

from typing import Optional

from fastapi import FastAPI, HTTPException, Header
from fastapi.concurrency import run_in_threadpool
from pydantic import BaseModel, HttpUrl
import trafilatura
import os

app = FastAPI(title="Trafilatura Extraction API")

API_KEY = os.getenv("TRAF_API_KEY")  # for authentication

class ExtractRequest(BaseModel):
    url: Optional[HttpUrl] = None
    html: Optional[str] = None

    favor_recall: bool = True
    include_comments: bool = False
    include_tables: bool = False
    include_images: bool = False
    include_formatting: bool = False

class ExtractResponse(BaseModel):
    text: str
    source_url: Optional[HttpUrl]
    from_html: bool
    length: int

def verify_api_key(key: Optional[str]):
    if API_KEY and key != API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.get("/health")
async def health(x_api_key: Optional[str] = Header(default=None)):
    verify_api_key(x_api_key)
    return {"status": "ok"}

@app.post("/extract", response_model=ExtractResponse)
async def extract_content(
    req: ExtractRequest,
    x_api_key: Optional[str] = Header(default=None)
):
    verify_api_key(x_api_key)

    if not req.url and not req.html:
        raise HTTPException(
            status_code=400,
            detail="You must provide either `url` or `html`."
        )

    html_source = req.html

    if req.url:
        downloaded = await run_in_threadpool(
            trafilatura.fetch_url, str(req.url)
        )
        if downloaded:
            html_source = downloaded
        elif not html_source:
            raise HTTPException(
                status_code=502,
                detail=f"Failed to download URL: {req.url}"
            )

    if not html_source:
        raise HTTPException(400, "No HTML to extract.")

    text = await run_in_threadpool(
        trafilatura.extract,
        html_source,
        include_comments=req.include_comments,
        include_tables=req.include_tables,
        include_images=req.include_images,
        include_formatting=req.include_formatting,
        favor_recall=req.favor_recall,
    )

    if not text:
        raise HTTPException(422, "No extractable content.")

    text = text.strip()

    return ExtractResponse(
        text=text,
        source_url=req.url,
        from_html=req.html is not None,
        length=len(text),
    )

This service is now complete — but let’s level it up.

🔒 Adding API Key Authentication

Want to prevent random users from hitting your service?

Just set an environment variable:

export TRAF_API_KEY="supersecret123"

Requests must now include:

X-API-Key: supersecret123

This keeps the service secure in production or when deployed in a private environment.

📦 Dockerizing the Service

Here’s a production-ready Dockerfile:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget curl build-essential && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy project files
COPY main.py /app/
COPY requirements.txt /app/

# Install Python deps
RUN pip install --no-cache-dir -r requirements.txt

# Expose FastAPI port
EXPOSE 8000

# Start the server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt

fastapi
uvicorn[standard]
trafilatura[all]

Build & run:

docker build -t trafilatura-service .
docker run -p 8000:8000 \
  -e TRAF_API_KEY="supersecret123" \
  trafilatura-service

Your service is now running in a container, secured with an API key.


🤝 Example: Calling the Service from C#

Here’s the simplest possible C# client:

using System.Net.Http.Json;

var http = new HttpClient
{
    BaseAddress = new Uri("http://localhost:8000")
};

http.DefaultRequestHeaders.Add("X-API-Key", "supersecret123");

var response = await http.PostAsJsonAsync("/extract", new {
    url = "https://example.com"
});

response.EnsureSuccessStatusCode();

var body = await response.Content.ReadAsStringAsync();
Console.WriteLine(body);

This integrates cleanly into your scraper → embed → pgvector pipeline.

🚀 Final Thoughts

Trafilatura is one of the best content extractors available today, and wrapping it in FastAPI gives you a powerful, reusable, language-agnostic microservice.

In this post, we built:

A complete extraction API
API key security
A Dockerized service ready for deployment
A C# client example

This architecture is ideal for:

AI pipelines
Search engines
Knowledge ingestion
Vector database indexing
Multi-language backends