Financial Data Extractor

An automated platform that scrapes, classifies, parses, and compiles multi-year financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from European company investor relations websites.

Project Overview

The Financial Data Extractor automates the labor-intensive process of collecting and standardizing financial data from annual reports. It handles:

Web Scraping: Automated discovery and download of annual reports from investor relations websites
Document Classification: Intelligent categorization of PDFs (Annual Reports, Presentations, etc.)
Data Extraction: LLM-powered parsing of financial statements from PDF documents
Normalization: Fuzzy matching and deduplication of line items across multiple years
Compilation: Aggregation of 10 years of financial data into unified views

Core Objectives

Scrape & Classify: Identify and categorize PDFs from investor relations websites
Parse: Extract financial data from Annual Reports using LLM (via OpenRouter)
Compile: Aggregate 10 years of financial data into unified views
Deduplicate: Align and merge similarly-named line items across years
Prioritize Latest: Use restated data from newer reports when available

System Architecture

The Financial Data Extractor follows a modern, layered architecture with clear separation between frontend, API, processing, and data layers:

graph LR
    subgraph "User Interface"
        UI[Next.js Frontend<br/>React 19 + TypeScript]
    end

    subgraph "API & Services"
        API[FastAPI Backend<br/>REST API]
    end

    subgraph "Background Processing"
        QUEUE[Celery Task Queue]
        WORKERS[Celery Workers<br/>Scraping • Extraction • Compilation]
    end

    subgraph "Data Storage"
        DB[(PostgreSQL<br/>Companies, Documents,<br/>Extractions, Statements)]
        CACHE[(Redis<br/>Cache & Message Broker)]
        STORAGE[(MinIO<br/>PDF Storage)]
    end

    subgraph "External Services"
        LLM[OpenRouter<br/>LLM API]
    end

    UI -->|HTTP/REST| API
    API -->|Queue Tasks| QUEUE
    QUEUE --> WORKERS
    API --> DB
    API --> CACHE
    WORKERS --> LLM
    WORKERS --> STORAGE
    WORKERS --> DB
    WORKERS --> CACHE

    classDef ui fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef external fill:#fce4ec,stroke:#880e4f,stroke-width:2px

    class UI ui
    class API api
    class QUEUE,WORKERS processing
    class DB,CACHE,STORAGE data
    class LLM external

Key Components:

Frontend: Next.js 15 application providing user interface and data visualization
Backend API: FastAPI REST API handling requests and business logic
Task Processing: Celery workers for async operations (scraping, extraction, compilation)
Data Layer: PostgreSQL for structured data, Redis for caching, MinIO for file storage
LLM Integration: OpenRouter API gateway for flexible model selection

For detailed architecture information, see the Architecture Overview documentation.

Target Companies

Initial Scope: 6 European companies seeded in database migrations
- AstraZeneca PLC (AZN - LSE, NASDAQ)
- SAP SE (SAP - XETRA, NYSE)
- Siemens AG (SIE - XETRA)
- ASML Holding N.V. (ASML - Euronext Amsterdam, NASDAQ)
- Unilever PLC (ULVR - LSE, UNA - Euronext Amsterdam, UL - NYSE)
- Allianz SE (ALV - XETRA)
Scalable: Architecture supports adding more companies dynamically via API

Data Flow

The system processes financial data through three main phases:

Scraping & Classification - Discover and categorize PDFs from investor relations websites
Parsing & Extraction - Extract financial statements using LLM
Normalization & Compilation - Normalize and compile 10 years of data

See the Data Flow documentation for detailed workflow information.

Technology Decisions

Key technology choices include:

OpenRouter - LLM API gateway for flexible model selection
PostgreSQL - JSONB support for flexible data structures
Celery - Distributed task queue for async processing
FastAPI - High-performance async API framework
Next.js 15 - Modern React framework with Server Components
React Query - Data fetching and caching
MinIO - S3-compatible object storage

See the Technology Decisions documentation for detailed rationale behind each choice.

Technology Stack

Backend

FastAPI - High-performance async web framework
Celery - Distributed task queue for background processing
PostgreSQL - Primary database with JSONB support
Redis - Caching layer and Celery message broker
SQLAlchemy - ORM for database operations
Alembic - Database migrations

Frontend

Next.js 15 - React framework with App Router
React - UI library
TailwindCSS - Utility-first CSS framework
shadcn/ui - Component library

Processing & AI

OpenRouter - LLM API gateway for accessing multiple models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
PyMuPDF - PDF processing and table extraction
pdfplumber - Alternative PDF text extraction
rapidfuzz - Fuzzy string matching for line item normalization
Crawl4AI - LLM-friendly web crawler for investor relations websites

Monitoring & Observability

The platform includes a comprehensive observability stack for monitoring, metrics, and logging:

Monitoring Stack

Prometheus - Metrics collection and storage (port 9090)
Grafana - Metrics visualization and dashboards (port 3200)
Loki - Log aggregation (port 3100)
Promtail - Log shipper for container logs
Flower - Celery task monitoring (port 5555)
PostgreSQL Exporter - Database metrics (port 9187)
Redis Exporter - Cache and broker metrics (port 9121)

Key Metrics

Business Metrics:

Total companies processed
Total PDFs classified
Statements extracted per day
Data quality scores
Extraction success rates

Technical Metrics:

API latency (p50, p95, p99) - via Prometheus from FastAPI /metrics
Celery queue depth - monitored via Flower and Redis exporter
Task success/failure rates - tracked in Flower and Prometheus
LLM API costs and latency - tracked via custom metrics
Database query performance - PostgreSQL exporter metrics
Redis connection pool usage - Redis exporter metrics
Storage usage (MinIO) - via MinIO console

Alerts:

Task failure rate > 5%
LLM API errors (429, 500, timeout)
Queue backlog > 1000 tasks
Database connection pool exhaustion
Redis memory usage > 80%
Disk space < 10% free

Dashboards

Grafana Dashboards (Pre-configured):

API Performance - Request latency, throughput, error rates from FastAPI
Database Metrics - PostgreSQL connection pool, query performance, transaction rates
Redis Metrics - Memory usage, connection count, command rates
Celery Tasks - Task execution times, success/failure rates, queue depths (via Prometheus)
Infrastructure - CPU, memory, disk, network usage

Log Aggregation:

All container logs aggregated via Promtail → Loki
Query logs via Grafana Explore view
Structured logging from FastAPI with request IDs
Celery worker logs with task context

Access

Grafana: http://localhost:3200 (admin/admin)
Prometheus: http://localhost:9090
Flower: http://localhost:5555
Loki: http://localhost:3100

Security Considerations

Rate Limiting: Prevent abuse of expensive extraction endpoints
Authentication: OAuth2 with JWT tokens
API Keys: Secure storage of OpenRouter API keys (env vars)
Input Validation: Sanitize company URLs to prevent SSRF
File Validation: Verify PDFs, scan for malware
Data Privacy: GDPR compliance for European companies

Quick Start

Get started quickly with the Getting Started Guide which includes:

Installation - Complete setup instructions
First Steps - Tutorial for your first extraction

For detailed setup information, see the Infrastructure Development Setup.

Documentation

This documentation site provides comprehensive guides organized by category:

Getting Started

Getting Started Guide - Quick start and installation
- Installation - Detailed setup instructions
- First Steps - Tutorial for your first extraction

Architecture

Architecture Overview - System design and architecture
- Data Flow - Detailed workflow from scraping to compilation
- Technology Decisions - Rationale behind technology choices

Backend

Backend Overview - FastAPI backend architecture, database, services, and testing
- Backend Architecture - Connection pool management, dependency injection, repository pattern, and exception handling
- Backend Testing - pytest setup, unit tests, integration tests with testcontainers

Frontend

Frontend Overview - Next.js 15 frontend architecture, components, React Query, and testing
- Frontend Architecture - Next.js 15 architecture, React Query, components, and development guide
- Frontend Testing - Vitest unit testing, React Testing Library, and testing strategies
- Frontend DevTools - React Query DevTools, ESLint plugin, and frontend debugging tools

API

API Overview - REST API documentation and reference
- API Reference - Complete REST API documentation with all endpoints, request/response formats, and examples

Database

Database Overview - Database schema, migrations, queries, and operations
- Database Schema - Table structures, relationships, and JSONB formats
- Database Migrations - Alembic migration commands and workflows
- Database Queries - Useful SQL queries for data inspection

Infrastructure

Infrastructure Overview - Docker setup, development environment, and task processing
- Development Setup - Docker Compose setup, service management, and monitoring stack
- Task Processing - Celery task system, workers, Flower monitoring
- Object Storage - MinIO object storage setup and usage

Testing

Testing Overview - Overview of testing strategies for backend and frontend
- Backend Testing - pytest guide for FastAPI backend
- Frontend Testing - Vitest guide for Next.js frontend

Development Tools

Development Tools - IDE configuration and development environment
- Cursor IDE Configuration - Cursor rules, VS Code settings, debug configurations, and tasks

Frequently Asked Questions

What financial statements does the Financial Data Extractor support?

The platform extracts three core financial statements from annual reports: Income Statement (also called P&L or Statement of Operations), Balance Sheet (Statement of Financial Position), and Cash Flow Statement. It preserves the exact line item hierarchy and supports multi-year data extraction from a single document.

Which LLM models are supported?

The platform uses OpenRouter as an API gateway, providing access to models including OpenAI GPT-4o, GPT-4o-mini, and Anthropic Claude. The model is configurable per task type — you can use a lighter model for web scraping and a more capable model for financial statement extraction.

How does it handle large PDF documents?

Financial annual reports can be 200+ pages. The extractor uses a smart page selection approach: PyMuPDF quickly extracts text from all pages, identifies pages containing financial keywords, then runs targeted table extraction (via camelot-py) only on those relevant pages. This reduces processing time from hours to minutes per document.

Can I add companies beyond the initial 6 European companies?

Yes. While the initial seed data includes AstraZeneca, SAP, Siemens, ASML, Unilever, and Allianz, any company with an investor relations website can be added dynamically via the REST API. The scraping pipeline will automatically discover and download their annual reports.

Is the Financial Data Extractor open source?

Yes. The entire platform is released under the Apache 2.0 License. The source code is available on GitHub.

License

Financial Data Extractor is released under the Apache 2.0 License. See the LICENSE file for more details.