Want to turn messy web pages into clean, AI-ready text? This guide walks you through building a lightweight FastAPI microservice powered by Trafilatura — complete with API key authentication and a fully dockerized deployment. Perfect for .NET, JavaScript, or Python apps that need reliable, language-agnostic content extraction.
🚀 Building “Trafilatura-as-a-Service”: Clean Web Scraping with FastAPI, Docker, and API Keys
If you’ve ever scraped content from across the web, you know the real challenge isn’t downloading HTML — it’s getting the clean text hidden inside.
Navigation menus. Cookie banners. Endless sidebars.
When you're building RAG pipelines, AI agents, or search engines, this boilerplate becomes noise.
That’s why Trafilatura is such a powerful tool:
it performs industry-leading boilerplate removal, turning chaotic DOM structures into clean, meaningful text.
But what if your backend isn’t Python?
What if your main pipeline is C#, JavaScript, Go, or Rust?
👉 That’s where Trafilatura-as-a-Service comes in.
A tiny FastAPI microservice that:
Accepts URLs or raw HTML
Returns clean text
Works with any language
Is easy to scale, cache, or secure
Runs anywhere — including Docker
Let’s build it.
🧱 Why Build Trafilatura as a Microservice?
This architecture shines when:
Your main application isn’t Python
You want to isolate scraping/extraction in a dedicated service
You’re feeding text into embeddings (pgvector, Pinecone, Qdrant, Chroma)
You're processing lots of URLs, and want scalable, stateless workers
You want consistent extraction quality across projects/apps
A clean separation keeps your codebase simple:
Raw URL → Trafilatura API → Clean text → Chunk → Embed → Vector DB → RAG
And your scraper pipeline instantly becomes more maintainable.
🛠️ The Core FastAPI Service
Below is the full FastAPI app that powers the service.
It supports:
✔ URL-based extraction
✔ Raw HTML extraction
✔ Error handling
✔ Extraction options
✔ Threadpool offloading for Trafilatura
main.py
from __future__ import annotations
from typing import Optional
from fastapi import FastAPI, HTTPException, Header
from fastapi.concurrency import run_in_threadpool
from pydantic import BaseModel, HttpUrl
import trafilatura
import os
app = FastAPI(title="Trafilatura Extraction API")
API_KEY = os.getenv("TRAF_API_KEY") # for authentication
class ExtractRequest(BaseModel):
url: Optional[HttpUrl] = None
html: Optional[str] = None
favor_recall: bool = True
include_comments: bool = False
include_tables: bool = False
include_images: bool = False
include_formatting: bool = False
class ExtractResponse(BaseModel):
text: str
source_url: Optional[HttpUrl]
from_html: bool
length: int
def verify_api_key(key: Optional[str]):
if API_KEY and key != API_KEY:
raise HTTPException(status_code=401, detail="Invalid API key")
@app.get("/health")
async def health(x_api_key: Optional[str] = Header(default=None)):
verify_api_key(x_api_key)
return {"status": "ok"}
@app.post("/extract", response_model=ExtractResponse)
async def extract_content(
req: ExtractRequest,
x_api_key: Optional[str] = Header(default=None)
):
verify_api_key(x_api_key)
if not req.url and not req.html:
raise HTTPException(
status_code=400,
detail="You must provide either `url` or `html`."
)
html_source = req.html
if req.url:
downloaded = await run_in_threadpool(
trafilatura.fetch_url, str(req.url)
)
if downloaded:
html_source = downloaded
elif not html_source:
raise HTTPException(
status_code=502,
detail=f"Failed to download URL: {req.url}"
)
if not html_source:
raise HTTPException(400, "No HTML to extract.")
text = await run_in_threadpool(
trafilatura.extract,
html_source,
include_comments=req.include_comments,
include_tables=req.include_tables,
include_images=req.include_images,
include_formatting=req.include_formatting,
favor_recall=req.favor_recall,
)
if not text:
raise HTTPException(422, "No extractable content.")
text = text.strip()
return ExtractResponse(
text=text,
source_url=req.url,
from_html=req.html is not None,
length=len(text),
)
This service is now complete — but let’s level it up.
🔒 Adding API Key Authentication
Want to prevent random users from hitting your service?
Just set an environment variable:
export TRAF_API_KEY="supersecret123"
Requests must now include:
X-API-Key: supersecret123
This keeps the service secure in production or when deployed in a private environment.
📦 Dockerizing the Service
Here’s a production-ready Dockerfile:
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
wget curl build-essential && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy project files
COPY main.py /app/
COPY requirements.txt /app/
# Install Python deps
RUN pip install --no-cache-dir -r requirements.txt
# Expose FastAPI port
EXPOSE 8000
# Start the server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
requirements.txt
fastapi
uvicorn[standard]
trafilatura[all]
Build & run:
docker build -t trafilatura-service .
docker run -p 8000:8000 \
-e TRAF_API_KEY="supersecret123" \
trafilatura-service
Your service is now running in a container, secured with an API key.
🤝 Example: Calling the Service from C#
Here’s the simplest possible C# client:
using System.Net.Http.Json;
var http = new HttpClient
{
BaseAddress = new Uri("http://localhost:8000")
};
http.DefaultRequestHeaders.Add("X-API-Key", "supersecret123");
var response = await http.PostAsJsonAsync("/extract", new {
url = "https://example.com"
});
response.EnsureSuccessStatusCode();
var body = await response.Content.ReadAsStringAsync();
Console.WriteLine(body);
This integrates cleanly into your scraper → embed → pgvector pipeline.
🚀 Final Thoughts
Trafilatura is one of the best content extractors available today, and wrapping it in FastAPI gives you a powerful, reusable, language-agnostic microservice.
In this post, we built:
A complete extraction API
API key security
A Dockerized service ready for deployment
A C# client example
This architecture is ideal for:
AI pipelines
Search engines
Knowledge ingestion
Vector database indexing
Multi-language backends