mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit 37a87297c3 docs: fix stale pool-size comments and NullPool references after #3441 (#3462 )

PR #3441 removed per-thread NullPool engines and changed pool_size
from 10→20 / max_overflow from 20→40, but several comments and docs
still referenced the old values and removed infrastructure.

- Update pool_size/max_overflow numbers in encrypted_db.py comments
- Remove dead ADR-0004 path reference (file never existed)
- Remove redundant has_per_database_salt() warning that fired on
  every cache-hit call (open_user_database already covers cache miss)
- Fix NullPool reference in processor_v2.py comment
- Fix stale thread_engines metric doc in connection_cleanup.py
- Fix stale dead-thread engine sweep comment in connection_cleanup.py
- Update architecture.md flowchart, FD budget table math (21→41,
  81→121), and key files table roles
- Update troubleshooting.md sweep description

2026-04-14 00:01:25 +02:00

20 KiB

Raw Blame History

Architecture Overview

This document provides detailed technical diagrams of Local Deep Research's architecture.

System Architecture

flowchart TB
    subgraph USER["👤 User Interface"]
        WEB[Web Browser<br/>localhost:5000]
    end

    subgraph FLASK["🌐 Flask Backend"]
        API[REST API + WebSocket]
        AUTH[Authentication<br/>CSRF Protection]
        ROUTES[Research Routes]
    end

    subgraph RESEARCH["🔬 Research Engine"]
        STRAT[Strategy Selector<br/>20+ strategies]
        QGEN[Question Generator]
        EXEC[Search Executor]
        SYNTH[Report Synthesizer]
    end

    subgraph LLM["🤖 LLM Providers"]
        direction TB
        LOCAL_LLM[/"🏠 Local LLMs"\]
        OLLAMA[Ollama]
        LMSTUDIO[LM Studio]
        CLOUD_LLM[/"☁️ Cloud LLMs"\]
        OPENAI[OpenAI]
        ANTHROPIC[Anthropic]
        GEMINI[Google Gemini]
        OPENROUTER[OpenRouter<br/>100+ models]
    end

    subgraph SEARCH["🔍 Search Engines"]
        direction TB
        LOCAL_SEARCH[/"🏠 Local Search"\]
        SEARXNG[SearXNG]
        ELASTIC[Elasticsearch]
        LIBRARY[Document Library]
        WEB_SEARCH[/"🌐 Web Search"\]
        TAVILY[Tavily]
        BRAVE[Brave]
        DDG[DuckDuckGo]
        ACADEMIC[/"📚 Academic"\]
        ARXIV[ArXiv]
        PUBMED[PubMed]
        SEMANTIC[Semantic Scholar]
    end

    subgraph STORAGE["💾 Storage Layer"]
        SQLCIPHER[(SQLCipher DB<br/>AES-256 Encrypted)]
        VECTORS[(Vector Store<br/>Embeddings)]
        FILES[(File Storage<br/>PDFs & Docs)]
    end

    subgraph OUTPUT["📄 Output"]
        MD[Markdown]
        PDF[PDF Export]
        LATEX[LaTeX]
        QUARTO[Quarto]
        RIS[RIS/BibTeX]
    end

    WEB <--> API
    API --> AUTH
    AUTH --> ROUTES
    ROUTES --> STRAT

    STRAT --> QGEN
    QGEN --> EXEC
    EXEC --> SYNTH

    EXEC <--> OLLAMA & LMSTUDIO
    EXEC <--> OPENAI & ANTHROPIC & GEMINI & OPENROUTER

    EXEC <--> SEARXNG & ELASTIC & LIBRARY
    EXEC <--> TAVILY & BRAVE & DDG
    EXEC <--> ARXIV & PUBMED & SEMANTIC

    SYNTH --> SQLCIPHER
    LIBRARY <--> VECTORS
    LIBRARY <--> FILES

    SYNTH --> MD & PDF & LATEX & QUARTO & RIS

    LOCAL_LLM ~~~ OLLAMA
    CLOUD_LLM ~~~ OPENAI
    LOCAL_SEARCH ~~~ SEARXNG
    WEB_SEARCH ~~~ TAVILY
    ACADEMIC ~~~ ARXIV

    style USER fill:#e1f5fe
    style FLASK fill:#fff3e0
    style RESEARCH fill:#f3e5f5
    style LLM fill:#e8f5e9
    style SEARCH fill:#fce4ec
    style STORAGE fill:#fff8e1
    style OUTPUT fill:#e0f2f1

Research Flow

flowchart LR
    subgraph INPUT["1️⃣ Input"]
        Q[Research Query]
    end

    subgraph ITERATE["2️⃣ Iterative Research"]
        direction TB
        GEN[Generate<br/>Questions]
        SEARCH[Parallel<br/>Search]
        ANALYZE[Analyze<br/>Results]
        GEN --> SEARCH --> ANALYZE
        ANALYZE -.->|"Need more info?"| GEN
    end

    subgraph SYNTHESIZE["3️⃣ Synthesis"]
        REPORT[Generate<br/>Report]
        CITE[Add<br/>Citations]
        REPORT --> CITE
    end

    subgraph OUTPUT["4️⃣ Output"]
        RESULT[Final Report<br/>with Sources]
    end

    Q --> GEN
    ANALYZE --> REPORT
    CITE --> RESULT

    style INPUT fill:#e3f2fd
    style ITERATE fill:#f3e5f5
    style SYNTHESIZE fill:#e8f5e9
    style OUTPUT fill:#fff3e0

Deployment Options

flowchart TB
    subgraph FULL_LOCAL["🏠 Fully Local (Maximum Privacy)"]
        direction LR
        L_LDR[LDR] <--> L_OLLAMA[Ollama]
        L_LDR <--> L_SEARX[SearXNG]
        L_LDR <--> L_DB[(Encrypted DB)]
    end

    subgraph HYBRID["⚡ Hybrid (Balanced)"]
        direction LR
        H_LDR[LDR] <--> H_OLLAMA[Ollama]
        H_LDR <-->|"Web Search"| H_CLOUD[Cloud APIs]
        H_LDR <--> H_DB[(Encrypted DB)]
    end

    subgraph CLOUD["☁️ Cloud-Powered (Maximum Speed)"]
        direction LR
        C_LDR[LDR] <--> C_OPENAI[OpenAI/Claude]
        C_LDR <--> C_TAVILY[Tavily/Brave]
        C_LDR <--> C_DB[(Encrypted DB)]
    end

    style FULL_LOCAL fill:#e8f5e9
    style HYBRID fill:#fff3e0
    style CLOUD fill:#e3f2fd

Feature Map

mindmap
  root((Local Deep<br/>Research))
    Research
      Quick Summary
      Detailed Reports
      Follow-up Questions
      20+ Strategies
    Search Sources
      Web Search
        SearXNG
        Tavily
        Brave
        DuckDuckGo
      Academic
        ArXiv
        PubMed
        Semantic Scholar
      Local
        Document Library
        Collections
        Elasticsearch
    LLM Support
      Local
        Ollama
        LM Studio
      Cloud
        OpenAI
        Anthropic
        Google
        OpenRouter
    Output
      Markdown
      PDF
      LaTeX
      Quarto
      RIS/BibTeX
    Features
      News Subscriptions
      Cost Analytics
      Benchmarking
      RAG Search
      Per-User Encryption

Component Details

LLM Providers

Provider	Type	Description
Ollama	Local	Self-hosted open-source models
LM Studio	Local	Desktop app for local models
OpenAI	Cloud	GPT-4, GPT-3.5
Anthropic	Cloud	Claude 3 family
Google	Cloud	Gemini models
OpenRouter	Cloud	100+ models via single API

Search Engines

Engine	Type	Best For
SearXNG	Local/Self-hosted	Privacy, aggregated results
Tavily	Cloud	AI-optimized search
ArXiv	Academic	Physics, CS, Math papers
PubMed	Academic	Biomedical research
Semantic Scholar	Academic	Cross-discipline papers
Wikipedia	Knowledge	General knowledge
Your Documents	Local	Private document search

Output Formats

Format	Use Case
Markdown	Default, web display
PDF	Sharing, printing
LaTeX	Academic papers
Quarto	Reproducible documents
RIS/BibTeX	Reference managers

Knowledge Loop: Research → Library → Future Research

One of LDR's powerful features is the ability to build a personal knowledge base that improves future research.

flowchart TB
    subgraph RESEARCH["1️⃣ Research"]
        Q[Your Question] --> Engine[Research Engine]
        Engine --> Results[Results + Sources]
    end

    subgraph DOWNLOAD["2️⃣ Download"]
        Results --> Track[Track Sources]
        Track --> Queue[Download Queue]
        Queue --> Extract[Download & Extract Text]
    end

    subgraph LIBRARY["3️⃣ Library"]
        Extract --> Store[(Document Storage)]
        Store --> Organize[Organize into Collections]
    end

    subgraph INDEX["4️⃣ Index"]
        Organize --> Chunk[Chunk Documents]
        Chunk --> Embed[Generate Embeddings]
        Embed --> FAISS[(Vector Index)]
    end

    subgraph REUSE["5️⃣ Reuse"]
        FAISS --> SearchEngine[Collection as Search Engine]
        SearchEngine --> Q
    end

    style RESEARCH fill:#e3f2fd
    style DOWNLOAD fill:#fff3e0
    style LIBRARY fill:#e8f5e9
    style INDEX fill:#f3e5f5
    style REUSE fill:#fce4ec

How It Works

Research Completes → Sources are tracked in ResearchResource table
Download Sources → Click "Get All Research PDFs" to queue downloads
- Smart downloaders for ArXiv, PubMed, Semantic Scholar, etc.
- Automatic text extraction from PDFs
Build Library → Documents stored in encrypted database
- Deduplication via content hash
- Multiple storage modes: database (encrypted), filesystem, text-only
Create Collections → Organize documents by topic/project
- Each collection can have different embedding settings
- Documents can belong to multiple collections
Index for Search → Generate vector embeddings
- Configurable chunk size and overlap
- FAISS index for fast similarity search
Use in Future Research → Select collection as search engine
- RAG search finds relevant passages
- Results cite back to your documents

Key Components

Component	Purpose
`DownloadService`	Manages PDF downloads with source-specific strategies
`LibraryService`	Queries and manages document library
`LibraryRAGService`	Creates vector indices for semantic search
`CollectionSearchEngine`	Searches collections using RAG
`Document`	Stores text content, metadata, file references
`DocumentChunk`	Stores indexed text chunks with embeddings
`Collection`	Groups documents with shared embedding settings

Storage Options

Mode	Security	Use Case
Database	AES-256 encrypted	Default, maximum security
Filesystem	Unencrypted	Need external tool access
Text Only	Encrypted text, no PDFs	Minimal storage

Technical Analysis & Project Health

Last updated: December 2024

This section provides a comprehensive technical analysis of the codebase, including quality metrics, architecture patterns, and project health indicators.

Project Statistics

Metric	Count
Test Classes	809+
Search Engine Implementations	25
LLM Provider Implementations	9
Search Strategies	20+
Abstract Base Classes	26
CI/CD Workflows	57
Security Scanners in CI	22+
Core Dependencies	63

Architecture Patterns

Extensibility Design

The codebase follows a consistent pattern for extensibility:

┌─────────────────────────────────────────────────────────────┐
│                    Abstract Base Classes                     │
├─────────────────────────────────────────────────────────────┤
│ BaseSearchEngine      │ Common interface for 25+ engines    │
│ BaseSearchStrategy    │ Strategy pattern for research       │
│ BaseCitationHandler   │ Citation processing abstraction     │
│ BaseQuestionGenerator │ Question generation interface       │
│ BaseExporter          │ Export format abstraction           │
└─────────────────────────────────────────────────────────────┘

Search Engine Plugin System

New search engines can be added by:

Creating a class inheriting from BaseSearchEngine
Placing it in web_search_engines/engines/
Auto-discovery handles registration

# Example: Adding a new search engine
class SearchEngineCustom(BaseSearchEngine):
    def run(self, query: str) -> List[Dict]:
        # Implementation
        pass

LLM Provider Integration

Supports 9 LLM providers with auto-discovery:

Ollama (local)
LM Studio (local)
OpenAI
Anthropic
Google Gemini
OpenRouter (100+ models)
DeepSeek
Mistral
Groq

Quality Ratings

pie title Project Health Ratings
    "Code Quality (86)" : 86
    "Security (93)" : 93
    "Extensibility (93)" : 93
    "Performance (85)" : 85
    "UX/DX (83)" : 83
    "Maintenance (89)" : 89

Detailed Ratings

Category	Score	Highlights
Code Quality	86/100	809+ test classes, ruff/mypy enforcement, comprehensive pre-commit hooks
Security	93/100	SQLCipher AES-256 encryption, full SHA pinning in CI, 22+ security scanners
Extensibility	93/100	26 abstract base classes, plugin architecture, strategy pattern throughout
Performance	85/100	Adaptive rate limiting, cache stampede protection, parallel search execution
UX/Developer Experience	83/100	Real-time WebSocket updates, in-tool documentation, comprehensive error handling
Maintenance	89/100	57 CI/CD workflows, automated security scanning, structured changelog
Overall	88/100	Production-ready with excellent security and extensibility

Security Architecture

flowchart TB
    subgraph SECURITY["🔐 Security Layers"]
        direction TB
        AUTH[Flask Session Auth<br/>CSRF Protection]
        ENCRYPT[SQLCipher<br/>AES-256 Encryption]
        SCAN[22+ Security Scanners<br/>CodeQL, Semgrep, Bandit...]
        PIN[Full SHA Pinning<br/>All 57 CI Workflows]
    end

    AUTH --> ENCRYPT
    ENCRYPT --> SCAN
    SCAN --> PIN

    style SECURITY fill:#e8f5e9

Security Features:

Per-user encrypted databases (SQLCipher with AES-256)
Full GitHub Action SHA pinning (not tag-based)
Comprehensive CI security scanning:
- CodeQL (Python, JavaScript)
- Semgrep (custom rulesets)
- Bandit (Python security)
- Trivy (container scanning)
- Dependency review
- Secret scanning

Performance Optimizations

Feature	Implementation
Parallel Search	`concurrent.futures.ThreadPoolExecutor` for multi-question search
Rate Limiting	Adaptive system with `learning_rate=0.3`
Cache Protection	`fetch_events` + `fetch_locks` for stampede prevention
Progress Streaming	SocketIO for real-time UI updates
Cross-Engine Filtering	LLM-powered relevance scoring and deduplication

Thread & Resource Lifecycle

Unlike typical web apps that share a single database connection pool, this application maintains separate database engines per user because each user has their own encrypted SQLite file with a unique SQLCipher key (SQLCipher is an encrypted extension of SQLite). This creates a threading challenge: every background thread needs its own engine for the user it serves.

A single QueuePool engine is kept per user (pool_size=20, max_overflow=40, pool_timeout=10) in DatabaseManager.connections[username]. It is created with check_same_thread=False, so it is shared safely across Flask request-handler threads AND background threads (research workers, scheduler jobs, metric writers). All of these acquire sessions from the same pool, which keeps FD usage bounded by pool_size + max_overflow per user.

Earlier revisions of the code maintained a second, parallel system of per-(username, thread_id) NullPool engines for background metric writes. That system was removed because orphaned entries leaked SQLCipher + WAL file handles (3 FDs per active connection) under load, eventually exhausting the 1024 FD soft limit. Routing all work through the single per-user QueuePool made FD usage bounded and let us delete ~200 lines of thread-engine bookkeeping.

Resource Cleanup Layers

Cleanup is defense-in-depth with multiple layers:

Layer	Trigger	What it cleans
`@thread_cleanup` decorator	Thread function exit (normal or exception)	DB session (returned to per-user pool), settings context, search context
`finally` blocks	Per-research in `run_research_process()`	Search engine HTTP sessions, strategy thread pool executors
`teardown_appcontext`	After each HTTP request	QueuePool session, thread-local session, triggers dead-thread credential sweep
Periodic pool dispose	Every 30 min	Calls `engine.dispose()` on per-user QueuePool engines to release SQLCipher+WAL file handles that accumulate from out-of-order connection closes
Logout cascade	User logout	Scheduler unregister (removes password) → DB close (`engine.dispose()`) → session destroy
Stale session cleanup	`before_request` (~1% of requests, sampled)	Clears Flask sessions for users whose DB connection is gone

Research Thread Lifecycle

flowchart TD
    A["POST /api/start_research"] --> B{"Slots available?"}
    B -- "Yes (direct)" --> C["start_research_process()"]
    B -- "No (queued)" --> D["Database queue"]
    D --> E["_process_queue_loop<br/>(daemon thread, 10s poll)"]
    E --> C

    C --> F["Research thread<br/>(daemon, semaphore-gated)"]
    F --> G["@thread_cleanup wraps<br/>run_research_process()"]

    G --> H["Strategy.find_relevant_information()"]
    H --> I["ThreadPoolExecutor sub-tasks<br/>(within strategies only)"]
    I --> J["Worker @thread_cleanup:<br/>close worker DB session"]

    J --> K["finally: search_engine.close()<br/>system.close() → strategy.close()"]
    K --> L["Main @thread_cleanup:<br/>close DB session, return connection to pool"]

    subgraph sweep ["Defense-in-depth (independent timers)"]
        M["Dead-thread credential sweep<br/>(~60s via processor_v2)"] -.-> N["Remove stale credentials<br/>for dead thread IDs"]
        O["Pool dispose<br/>(30 min via connection_cleanup)"] -.-> P["engine.dispose() on all<br/>QueuePool engines"]
    end

    style sweep fill:#f0f0f0,stroke:#999

FD Budget

Each SQLCipher connection in WAL (Write-Ahead Logging) mode uses 2 file descriptors (main db + WAL file). All connections to the same database within a process share 1 SHM (shared-memory) file descriptor. The formula per user database is: connections × 2 + 1.

Component	FD Formula	With defaults
QueuePool (steady state)	`logged_in_users × (pool_size × 2 + 1)`	`users × 41` FDs
QueuePool (peak)	`logged_in_users × ((pool_size + max_overflow) × 2 + 1)`	`users × 121` FDs

Default Linux ulimit is 1024 soft (bare metal), which is tight for multi-user deployments. Docker's daemon default (typically 1M+) is adequate. QueuePool engines are created at login and disposed at logout, so only active users consume FDs.

For more on diagnosing FD exhaustion, see Troubleshooting - Resource Exhaustion.

Key Files

File	Role
`src/local_deep_research/database/encrypted_db.py`	`DatabaseManager`, engine lifecycle, pool management
`src/local_deep_research/database/thread_local_session.py`	`@thread_cleanup` decorator, thread-local sessions, credential cleanup
`src/local_deep_research/web/app_factory.py`	`teardown_appcontext` handler, cleanup orchestration
`src/local_deep_research/web/services/research_service.py`	Research thread creation, `run_research_process()`
`src/local_deep_research/web/queue/processor_v2.py`	Queue processing, credential cleanup trigger

Areas for Improvement

While the project scores highly overall, these areas have room for growth:

Integration Testing - More end-to-end tests for full research workflows
API Documentation - OpenAPI/Swagger spec for REST endpoints
Metrics Dashboard - Prometheus/Grafana integration for monitoring
Resource Observability - Expose FD count, thread count, and connection pool stats in /api/v1/health; add periodic sweep logging
Async Architecture - Migration to async/await for I/O-bound operations

Key Source Files

Component	Location	Purpose
Research Engine	`src/local_deep_research/search_system.py`	Main `AdvancedSearchSystem` class
Strategies	`src/local_deep_research/advanced_search_system/strategies/`	20+ research strategies
Search Engines	`src/local_deep_research/web_search_engines/engines/`	25 search engine implementations
Report Generation	`src/local_deep_research/report_generator.py`	`IntegratedReportGenerator`
Web API	`src/local_deep_research/web/routes/`	Flask routes and WebSocket handlers
Database	`src/local_deep_research/web/database/`	SQLCipher models and migrations
Encrypted DB	`src/local_deep_research/database/encrypted_db.py`	Per-user SQLCipher engine lifecycle
Thread Sessions	`src/local_deep_research/database/thread_local_session.py`	Thread-safe session management and cleanup
Settings	`src/local_deep_research/config/`	Configuration and LLM setup

Contributing to Architecture

When extending the system:

Adding Search Engines: Inherit from BaseSearchEngine, implement run() method
Adding Strategies: Inherit from BaseSearchStrategy, implement analyze_topic() method
Adding LLM Providers: Add to config/llm_config.py with proper initialization
Adding Export Formats: Inherit from base exporter pattern in utilities/

See CONTRIBUTING.md for detailed guidelines.

20 KiB Raw Blame History Unescape Escape