Financial Data Extractor

An automated platform that scrapes, classifies, parses, and compiles multi-year financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from European company investor relations websites.

Project Overview

The Financial Data Extractor automates the labor-intensive process of collecting and standardizing financial data from annual reports. It handles:

  • Web Scraping: Automated discovery and download of annual reports from investor relations websites
  • Document Classification: Intelligent categorization of PDFs (Annual Reports, Presentations, etc.)
  • Data Extraction: LLM-powered parsing of financial statements from PDF documents
  • Normalization: Fuzzy matching and deduplication of line items across multiple years
  • Compilation: Aggregation of 10 years of financial data into unified views

Core Objectives

  1. Scrape & Classify: Identify and categorize PDFs from investor relations websites
  2. Parse: Extract financial data from Annual Reports using LLM (via OpenRouter)
  3. Compile: Aggregate 10 years of financial data into unified views
  4. Deduplicate: Align and merge similarly-named line items across years
  5. Prioritize Latest: Use restated data from newer reports when available

System Architecture

The Financial Data Extractor follows a modern, layered architecture with clear separation between frontend, API, processing, and data layers:

graph LR
    subgraph "User Interface"
        UI[Next.js Frontend<br/>React 19 + TypeScript]
    end

    subgraph "API & Services"
        API[FastAPI Backend<br/>REST API]
    end

    subgraph "Background Processing"
        QUEUE[Celery Task Queue]
        WORKERS[Celery Workers<br/>Scraping • Extraction • Compilation]
    end

    subgraph "Data Storage"
        DB[(PostgreSQL<br/>Companies, Documents,<br/>Extractions, Statements)]
        CACHE[(Redis<br/>Cache & Message Broker)]
        STORAGE[(MinIO<br/>PDF Storage)]
    end

    subgraph "External Services"
        LLM[OpenRouter<br/>LLM API]
    end

    UI -->|HTTP/REST| API
    API -->|Queue Tasks| QUEUE
    QUEUE --> WORKERS
    API --> DB
    API --> CACHE
    WORKERS --> LLM
    WORKERS --> STORAGE
    WORKERS --> DB
    WORKERS --> CACHE

    classDef ui fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    classDef external fill:#fce4ec,stroke:#880e4f,stroke-width:2px

    class UI ui
    class API api
    class QUEUE,WORKERS processing
    class DB,CACHE,STORAGE data
    class LLM external

Key Components:

  • Frontend: Next.js 15 application providing user interface and data visualization
  • Backend API: FastAPI REST API handling requests and business logic
  • Task Processing: Celery workers for async operations (scraping, extraction, compilation)
  • Data Layer: PostgreSQL for structured data, Redis for caching, MinIO for file storage
  • LLM Integration: OpenRouter API gateway for flexible model selection

For detailed architecture information, see the Architecture Overview documentation.

Target Companies

  • Initial Scope: 6 European companies seeded in database migrations
    • AstraZeneca PLC (AZN - LSE, NASDAQ)
    • SAP SE (SAP - XETRA, NYSE)
    • Siemens AG (SIE - XETRA)
    • ASML Holding N.V. (ASML - Euronext Amsterdam, NASDAQ)
    • Unilever PLC (ULVR - LSE, UNA - Euronext Amsterdam, UL - NYSE)
    • Allianz SE (ALV - XETRA)
  • Scalable: Architecture supports adding more companies dynamically via API

Data Flow

The system processes financial data through three main phases:

  1. Scraping & Classification - Discover and categorize PDFs from investor relations websites
  2. Parsing & Extraction - Extract financial statements using LLM
  3. Normalization & Compilation - Normalize and compile 10 years of data

See the Data Flow documentation for detailed workflow information.

Technology Decisions

Key technology choices include:

  • OpenRouter - LLM API gateway for flexible model selection
  • PostgreSQL - JSONB support for flexible data structures
  • Celery - Distributed task queue for async processing
  • FastAPI - High-performance async API framework
  • Next.js 15 - Modern React framework with Server Components
  • React Query - Data fetching and caching
  • MinIO - S3-compatible object storage

See the Technology Decisions documentation for detailed rationale behind each choice.

Technology Stack

Backend

  • FastAPI - High-performance async web framework
  • Celery - Distributed task queue for background processing
  • PostgreSQL - Primary database with JSONB support
  • Redis - Caching layer and Celery message broker
  • SQLAlchemy - ORM for database operations
  • Alembic - Database migrations

Frontend

  • Next.js 15 - React framework with App Router
  • React - UI library
  • TailwindCSS - Utility-first CSS framework
  • shadcn/ui - Component library

Processing & AI

  • OpenRouter - LLM API gateway for accessing multiple models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
  • PyMuPDF - PDF processing and table extraction
  • pdfplumber - Alternative PDF text extraction
  • rapidfuzz - Fuzzy string matching for line item normalization
  • Crawl4AI - LLM-friendly web crawler for investor relations websites

Monitoring & Observability

The platform includes a comprehensive observability stack for monitoring, metrics, and logging:

Monitoring Stack

  • Prometheus - Metrics collection and storage (port 9090)
  • Grafana - Metrics visualization and dashboards (port 3200)
  • Loki - Log aggregation (port 3100)
  • Promtail - Log shipper for container logs
  • Flower - Celery task monitoring (port 5555)
  • PostgreSQL Exporter - Database metrics (port 9187)
  • Redis Exporter - Cache and broker metrics (port 9121)

Key Metrics

Business Metrics:

  • Total companies processed
  • Total PDFs classified
  • Statements extracted per day
  • Data quality scores
  • Extraction success rates

Technical Metrics:

  • API latency (p50, p95, p99) - via Prometheus from FastAPI /metrics
  • Celery queue depth - monitored via Flower and Redis exporter
  • Task success/failure rates - tracked in Flower and Prometheus
  • LLM API costs and latency - tracked via custom metrics
  • Database query performance - PostgreSQL exporter metrics
  • Redis connection pool usage - Redis exporter metrics
  • Storage usage (MinIO) - via MinIO console

Alerts:

  • Task failure rate > 5%
  • LLM API errors (429, 500, timeout)
  • Queue backlog > 1000 tasks
  • Database connection pool exhaustion
  • Redis memory usage > 80%
  • Disk space < 10% free

Dashboards

Grafana Dashboards (Pre-configured):

  1. API Performance - Request latency, throughput, error rates from FastAPI
  2. Database Metrics - PostgreSQL connection pool, query performance, transaction rates
  3. Redis Metrics - Memory usage, connection count, command rates
  4. Celery Tasks - Task execution times, success/failure rates, queue depths (via Prometheus)
  5. Infrastructure - CPU, memory, disk, network usage

Log Aggregation:

  • All container logs aggregated via Promtail → Loki
  • Query logs via Grafana Explore view
  • Structured logging from FastAPI with request IDs
  • Celery worker logs with task context

Access

  • Grafana: http://localhost:3200 (admin/admin)
  • Prometheus: http://localhost:9090
  • Flower: http://localhost:5555
  • Loki: http://localhost:3100

Security Considerations

  1. Rate Limiting: Prevent abuse of expensive extraction endpoints
  2. Authentication: OAuth2 with JWT tokens
  3. API Keys: Secure storage of OpenRouter API keys (env vars)
  4. Input Validation: Sanitize company URLs to prevent SSRF
  5. File Validation: Verify PDFs, scan for malware
  6. Data Privacy: GDPR compliance for European companies

Quick Start

Get started quickly with the Getting Started Guide which includes:

For detailed setup information, see the Infrastructure Development Setup.

Documentation

This documentation site provides comprehensive guides organized by category:

Getting Started

Architecture

Backend

  • Backend Overview - FastAPI backend architecture, database, services, and testing
    • Backend Architecture - Connection pool management, dependency injection, repository pattern, and exception handling
    • Backend Testing - pytest setup, unit tests, integration tests with testcontainers

Frontend

  • Frontend Overview - Next.js 15 frontend architecture, components, React Query, and testing

API

  • API Overview - REST API documentation and reference
    • API Reference - Complete REST API documentation with all endpoints, request/response formats, and examples

Database

Infrastructure

Testing

Development Tools

Frequently Asked Questions

What financial statements does the Financial Data Extractor support?

The platform extracts three core financial statements from annual reports: Income Statement (also called P&L or Statement of Operations), Balance Sheet (Statement of Financial Position), and Cash Flow Statement. It preserves the exact line item hierarchy and supports multi-year data extraction from a single document.

Which LLM models are supported?

The platform uses OpenRouter as an API gateway, providing access to models including OpenAI GPT-4o, GPT-4o-mini, and Anthropic Claude. The model is configurable per task type — you can use a lighter model for web scraping and a more capable model for financial statement extraction.

How does it handle large PDF documents?

Financial annual reports can be 200+ pages. The extractor uses a smart page selection approach: PyMuPDF quickly extracts text from all pages, identifies pages containing financial keywords, then runs targeted table extraction (via camelot-py) only on those relevant pages. This reduces processing time from hours to minutes per document.

Can I add companies beyond the initial 6 European companies?

Yes. While the initial seed data includes AstraZeneca, SAP, Siemens, ASML, Unilever, and Allianz, any company with an investor relations website can be added dynamically via the REST API. The scraping pipeline will automatically discover and download their annual reports.

Is the Financial Data Extractor open source?

Yes. The entire platform is released under the Apache 2.0 License. The source code is available on GitHub.

License

Financial Data Extractor is released under the Apache 2.0 License. See the LICENSE file for more details.