Financial Data Extractor
An automated platform that scrapes, classifies, parses, and compiles multi-year financial statements (Income Statement, Balance Sheet, Cash Flow Statement) from European company investor relations websites.
Project Overview
The Financial Data Extractor automates the labor-intensive process of collecting and standardizing financial data from annual reports. It handles:
- Web Scraping: Automated discovery and download of annual reports from investor relations websites
- Document Classification: Intelligent categorization of PDFs (Annual Reports, Presentations, etc.)
- Data Extraction: LLM-powered parsing of financial statements from PDF documents
- Normalization: Fuzzy matching and deduplication of line items across multiple years
- Compilation: Aggregation of 10 years of financial data into unified views
Core Objectives
- Scrape & Classify: Identify and categorize PDFs from investor relations websites
- Parse: Extract financial data from Annual Reports using LLM (via OpenRouter)
- Compile: Aggregate 10 years of financial data into unified views
- Deduplicate: Align and merge similarly-named line items across years
- Prioritize Latest: Use restated data from newer reports when available
System Architecture
The Financial Data Extractor follows a modern, layered architecture with clear separation between frontend, API, processing, and data layers:
graph LR
subgraph "User Interface"
UI[Next.js Frontend<br/>React 19 + TypeScript]
end
subgraph "API & Services"
API[FastAPI Backend<br/>REST API]
end
subgraph "Background Processing"
QUEUE[Celery Task Queue]
WORKERS[Celery Workers<br/>Scraping • Extraction • Compilation]
end
subgraph "Data Storage"
DB[(PostgreSQL<br/>Companies, Documents,<br/>Extractions, Statements)]
CACHE[(Redis<br/>Cache & Message Broker)]
STORAGE[(MinIO<br/>PDF Storage)]
end
subgraph "External Services"
LLM[OpenRouter<br/>LLM API]
end
UI -->|HTTP/REST| API
API -->|Queue Tasks| QUEUE
QUEUE --> WORKERS
API --> DB
API --> CACHE
WORKERS --> LLM
WORKERS --> STORAGE
WORKERS --> DB
WORKERS --> CACHE
classDef ui fill:#e1f5ff,stroke:#01579b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef data fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
classDef external fill:#fce4ec,stroke:#880e4f,stroke-width:2px
class UI ui
class API api
class QUEUE,WORKERS processing
class DB,CACHE,STORAGE data
class LLM external
Key Components:
- Frontend: Next.js 15 application providing user interface and data visualization
- Backend API: FastAPI REST API handling requests and business logic
- Task Processing: Celery workers for async operations (scraping, extraction, compilation)
- Data Layer: PostgreSQL for structured data, Redis for caching, MinIO for file storage
- LLM Integration: OpenRouter API gateway for flexible model selection
For detailed architecture information, see the Architecture Overview documentation.
Target Companies
- Initial Scope: 6 European companies seeded in database migrations
- AstraZeneca PLC (AZN - LSE, NASDAQ)
- SAP SE (SAP - XETRA, NYSE)
- Siemens AG (SIE - XETRA)
- ASML Holding N.V. (ASML - Euronext Amsterdam, NASDAQ)
- Unilever PLC (ULVR - LSE, UNA - Euronext Amsterdam, UL - NYSE)
- Allianz SE (ALV - XETRA)
- Scalable: Architecture supports adding more companies dynamically via API
Data Flow
The system processes financial data through three main phases:
- Scraping & Classification - Discover and categorize PDFs from investor relations websites
- Parsing & Extraction - Extract financial statements using LLM
- Normalization & Compilation - Normalize and compile 10 years of data
See the Data Flow documentation for detailed workflow information.
Technology Decisions
Key technology choices include:
- OpenRouter - LLM API gateway for flexible model selection
- PostgreSQL - JSONB support for flexible data structures
- Celery - Distributed task queue for async processing
- FastAPI - High-performance async API framework
- Next.js 15 - Modern React framework with Server Components
- React Query - Data fetching and caching
- MinIO - S3-compatible object storage
See the Technology Decisions documentation for detailed rationale behind each choice.
Technology Stack
Backend
- FastAPI - High-performance async web framework
- Celery - Distributed task queue for background processing
- PostgreSQL - Primary database with JSONB support
- Redis - Caching layer and Celery message broker
- SQLAlchemy - ORM for database operations
- Alembic - Database migrations
Frontend
- Next.js 15 - React framework with App Router
- React - UI library
- TailwindCSS - Utility-first CSS framework
- shadcn/ui - Component library
Processing & AI
- OpenRouter - LLM API gateway for accessing multiple models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
- PyMuPDF - PDF processing and table extraction
- pdfplumber - Alternative PDF text extraction
- rapidfuzz - Fuzzy string matching for line item normalization
- Crawl4AI - LLM-friendly web crawler for investor relations websites
Monitoring & Observability
The platform includes a comprehensive observability stack for monitoring, metrics, and logging:
Monitoring Stack
- Prometheus - Metrics collection and storage (port 9090)
- Grafana - Metrics visualization and dashboards (port 3200)
- Loki - Log aggregation (port 3100)
- Promtail - Log shipper for container logs
- Flower - Celery task monitoring (port 5555)
- PostgreSQL Exporter - Database metrics (port 9187)
- Redis Exporter - Cache and broker metrics (port 9121)
Key Metrics
Business Metrics:
- Total companies processed
- Total PDFs classified
- Statements extracted per day
- Data quality scores
- Extraction success rates
Technical Metrics:
- API latency (p50, p95, p99) - via Prometheus from FastAPI
/metrics - Celery queue depth - monitored via Flower and Redis exporter
- Task success/failure rates - tracked in Flower and Prometheus
- LLM API costs and latency - tracked via custom metrics
- Database query performance - PostgreSQL exporter metrics
- Redis connection pool usage - Redis exporter metrics
- Storage usage (MinIO) - via MinIO console
Alerts:
- Task failure rate > 5%
- LLM API errors (429, 500, timeout)
- Queue backlog > 1000 tasks
- Database connection pool exhaustion
- Redis memory usage > 80%
- Disk space < 10% free
Dashboards
Grafana Dashboards (Pre-configured):
- API Performance - Request latency, throughput, error rates from FastAPI
- Database Metrics - PostgreSQL connection pool, query performance, transaction rates
- Redis Metrics - Memory usage, connection count, command rates
- Celery Tasks - Task execution times, success/failure rates, queue depths (via Prometheus)
- Infrastructure - CPU, memory, disk, network usage
Log Aggregation:
- All container logs aggregated via Promtail → Loki
- Query logs via Grafana Explore view
- Structured logging from FastAPI with request IDs
- Celery worker logs with task context
Access
- Grafana:
http://localhost:3200(admin/admin) - Prometheus:
http://localhost:9090 - Flower:
http://localhost:5555 - Loki:
http://localhost:3100
Security Considerations
- Rate Limiting: Prevent abuse of expensive extraction endpoints
- Authentication: OAuth2 with JWT tokens
- API Keys: Secure storage of OpenRouter API keys (env vars)
- Input Validation: Sanitize company URLs to prevent SSRF
- File Validation: Verify PDFs, scan for malware
- Data Privacy: GDPR compliance for European companies
Quick Start
Get started quickly with the Getting Started Guide which includes:
- Installation - Complete setup instructions
- First Steps - Tutorial for your first extraction
For detailed setup information, see the Infrastructure Development Setup.
Documentation
This documentation site provides comprehensive guides organized by category:
Getting Started
- Getting Started Guide - Quick start and installation
- Installation - Detailed setup instructions
- First Steps - Tutorial for your first extraction
Architecture
- Architecture Overview - System design and architecture
- Data Flow - Detailed workflow from scraping to compilation
- Technology Decisions - Rationale behind technology choices
Backend
- Backend Overview - FastAPI backend architecture, database, services, and testing
- Backend Architecture - Connection pool management, dependency injection, repository pattern, and exception handling
- Backend Testing - pytest setup, unit tests, integration tests with testcontainers
Frontend
- Frontend Overview - Next.js 15 frontend architecture, components, React Query, and testing
- Frontend Architecture - Next.js 15 architecture, React Query, components, and development guide
- Frontend Testing - Vitest unit testing, React Testing Library, and testing strategies
- Frontend DevTools - React Query DevTools, ESLint plugin, and frontend debugging tools
API
- API Overview - REST API documentation and reference
- API Reference - Complete REST API documentation with all endpoints, request/response formats, and examples
Database
- Database Overview - Database schema, migrations, queries, and operations
- Database Schema - Table structures, relationships, and JSONB formats
- Database Migrations - Alembic migration commands and workflows
- Database Queries - Useful SQL queries for data inspection
Infrastructure
- Infrastructure Overview - Docker setup, development environment, and task processing
- Development Setup - Docker Compose setup, service management, and monitoring stack
- Task Processing - Celery task system, workers, Flower monitoring
- Object Storage - MinIO object storage setup and usage
Testing
- Testing Overview - Overview of testing strategies for backend and frontend
- Backend Testing - pytest guide for FastAPI backend
- Frontend Testing - Vitest guide for Next.js frontend
Development Tools
- Development Tools - IDE configuration and development environment
- Cursor IDE Configuration - Cursor rules, VS Code settings, debug configurations, and tasks
Frequently Asked Questions
What financial statements does the Financial Data Extractor support?
The platform extracts three core financial statements from annual reports: Income Statement (also called P&L or Statement of Operations), Balance Sheet (Statement of Financial Position), and Cash Flow Statement. It preserves the exact line item hierarchy and supports multi-year data extraction from a single document.
Which LLM models are supported?
The platform uses OpenRouter as an API gateway, providing access to models including OpenAI GPT-4o, GPT-4o-mini, and Anthropic Claude. The model is configurable per task type — you can use a lighter model for web scraping and a more capable model for financial statement extraction.
How does it handle large PDF documents?
Financial annual reports can be 200+ pages. The extractor uses a smart page selection approach: PyMuPDF quickly extracts text from all pages, identifies pages containing financial keywords, then runs targeted table extraction (via camelot-py) only on those relevant pages. This reduces processing time from hours to minutes per document.
Can I add companies beyond the initial 6 European companies?
Yes. While the initial seed data includes AstraZeneca, SAP, Siemens, ASML, Unilever, and Allianz, any company with an investor relations website can be added dynamically via the REST API. The scraping pipeline will automatically discover and download their annual reports.
Is the Financial Data Extractor open source?
Yes. The entire platform is released under the Apache 2.0 License. The source code is available on GitHub.
License
Financial Data Extractor is released under the Apache 2.0 License. See the LICENSE file for more details.