mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-16 20:10:39 +03:00
7d8e02a7e2d184d1e823cbf5dfd24ccb4ed4c548
8 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e79a9fb76a |
docs(resource-cleanup) + fix: Round 9 audit results + per-user lock-dict cleanup (#4077)
* docs(resource-cleanup): Round 9 audit results + conditional deferred fixes
Capture the Round 9 broader-resource-audit results so:
- Future contributors don't re-audit the same paths
- If a relevant production symptom ever appears, the doc points
directly at a pre-thought-through conditional fix
Round 9 ran two passes of three parallel agents looking for resource
leaks BEYOND FDs (memory/cache growth, thread/lock lifecycle, DB state
hygiene). Round 1 produced six HIGH-confidence findings; Round 2
verification refuted four of them and downgraded one.
Added to the audit ledger (next to the existing Wave 7 entry):
- Refuted findings with WHY they were refuted:
- @cache on get_available_providers (called with None — hashable,
cardinality 1; dicts would raise TypeError, not cache silently)
- ThreadLocalSession identity-map growth (expire_on_commit=True
default clears the map on every commit)
- token_usage table unbounded growth (design-intentional permanent
audit table; time-series compound indexes; /api/context-overflow
queries historical windows)
- search_calls table unbounded growth (same shape and verdict)
- Three per-user lock dicts (_user_init_locks, _user_locks,
_user_critical_locks): technically correct that they never clean
up on user delete, but ~296 bytes per user × 3 dicts = ~900 KB
ceiling at 1000 users. Practically negligible.
- app_logs (ResearchLog) table — the one finding that survived
verification as a real but small concern. No auto-retention; only
cleaned by cascade-delete when parent Research row is manually
removed. For users keeping all research, logs accumulate.
Added to "Intentionally not done (deferred)":
- app_logs retention setting + scheduled cleanup job. Includes the
trigger conditions that would justify the work and the
implementation sketch (settings key, daily APScheduler job,
regression test, news fragment).
- Per-user lock dict cleanup on user delete. Cosmetic; included with
trigger conditions and one-line-per-file sketch so it's actionable
if multi-user deployments ever see it.
No code changes. Documentation only.
* fix(resource-cleanup): pop per-user lock-dict entries on user close
Three module-level per-user lock dicts had no removal hook, so each
accumulated one ``threading.Lock`` entry per username over the
process lifetime:
- ``_user_init_locks`` in ``database/library_init.py`` (serializes
collection-init check-then-insert)
- ``_user_locks`` in ``database/backup/backup_service.py`` (per-user
backup serialization)
- ``_user_critical_locks`` on ``QueueProcessorV2`` (per-user
count-then-start critical section)
The ceiling was ~296 bytes/entry × 3 dicts ≈ ~900 bytes per user
across all three — bounded by total user count, microscopic relative
to the eventpoll FD leak that motivated the original investigation,
but real for long-lived multi-user instances with user-account
churn. Identified in Round 9 of the broader resource-leak audit
(see docs/developing/resource-cleanup.md).
The fix:
- Each module now exposes a ``pop_user_*_lock(username)`` function
(or method for the QueueProcessor instance dict) that pops the
entry under the existing per-dict lock.
- ``connection_cleanup._pop_per_user_locks(username)`` is a shared
helper that lazy-imports and calls all three with individual
try/except blocks so a failure in one doesn't skip the others.
- The helper is invoked from both user-close paths:
- ``cleanup_idle_connections`` (the 5-minute sweeper) after
``db_manager.close_user_database``
- ``web/auth/routes.py`` after the logout and password-change
``close_user_database`` calls
Pattern mirrors the existing per-user cleanup in those code paths
(scheduler.unregister_user, session_password_store.clear_*).
Tests in ``tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks``:
- Unit test: populate all three dicts, call the helper, assert all
three entries are gone.
- Idempotency: pop on a never-registered user must not raise.
- Integration: ``cleanup_idle_connections`` actually invokes the
helper for each user it closes (verified via the library-init dict
for "alice").
Doc updated: the entry that R9A2 identified as "technically correct,
practically negligible" is moved from the audit ledger's findings
list into a "Fixed in this PR" subsection; the matching
deferred-fix entry in "Intentionally not done" is removed.
Adds a towncrier bugfix fragment.
* review: address self-review findings on PR #4077
Four fixes from a multi-round agent review of this PR:
1. Move ``_pop_per_user_locks`` outside the ``close_user_database``
try/except in ``connection_cleanup.py``. Previously the pop was
inside the same try block, so a DB-close failure — the path that
``test_close_failure_does_not_abort_loop`` already exercises —
skipped the pop and leaked the lock-dict entry. New test
``test_pop_runs_even_when_close_user_database_fails`` pins this.
2. Bump three ``logger.debug`` to ``logger.warning`` in
``_pop_per_user_locks``. Matches the sibling scheduler-unregister
error handler in the same module; debug-level silently masked
lock-dict accumulation across cycles.
3. Doc accuracy fix in ``docs/developing/resource-cleanup.md``.
The entry called all three dicts "module-level" but
``_user_critical_locks`` is an instance attribute on
``QueueProcessorV2``. Rewrote to distinguish module-level dicts
from the instance attribute and to note the pop now runs outside
the close try/except.
4. Integration test pre-populates all three lock-dict entries and
asserts all three are absent post-cleanup, not just
``_user_init_locks``. Switched the test username from "alice"
(used by other tests in the module) to a dedicated sentinel.
Tests: 23/23 in ``tests/web/auth/test_connection_cleanup.py``.
Race-condition concerns flagged in Round 1 (TOCTOU between pop and
``_get_user_*_lock``) were verified in Round 2 to be either guarded
by Python's reference-counted lock semantics (``with lock:`` keeps
the original lock object alive after dict pop) or bounded to a +1
race window across multiple browser sessions — not ship-blocking.
The narrow ``library_init`` race surfaces as a propagated
``IntegrityError`` on collection insert, not silent corruption.
|
||
|
|
6f18a711d2 |
docs(resource-cleanup): expand Wave 7 with full audit ledger (#4054)
* docs(resource-cleanup): expand Wave 7 with full audit ledger Replaces the brief "follow-up gaps" bullet list with the full ledger of what the broader audit during #4047 actually examined, split into four scannable subsections: - Checked and confirmed clean: non-Ollama LLM providers, HTTP session lifecycle, subprocess/pidfd, asyncio loops, file handles, SocketIO connect/disconnect. - Flagged then verified NOT a real FD leak: OllamaEmbeddings (uses the deprecated langchain_community class with no httpx client), auth_db + journal_quality engines escaping shutdown_databases (bounded pools, not growing), LibraryRAGService in three RAG SSE endpoints (RAM churn, no FDs — FAISS uses pickle.load, embeddings hold no FDs per the item above, SentenceTransformer mmaps are process-wide singletons). - Minor findings: daemon threads without explicit shutdown, abandoned-research cleanup on socket disconnect — both reaped at process exit, not steady-state leaks. - Future-proofing note: ``langchain_community.embeddings.OllamaEmbeddings`` is deprecated; the replacement ``langchain_ollama.OllamaEmbeddings`` DOES carry ``_client`` and ``_async_client`` (verified by direct introspection), so when LDR migrates the in-running-loop eventpoll leak class will reappear for embeddings unless ``_close_base_llm`` is generalized. Direct introspection done at audit time confirms each verdict: ``[a for a in dir(e) if 'client' in a.lower()]`` returned ``[]`` for the deprecated class and a non-empty list for the new class. This ledger saves the next contributor from re-running the same agent sweep when investigating a future FD spike. No code changes. * docs(resource-cleanup): add Round-8 pidfd finding (fixed by #3971) The Wave 7 ledger covered the eventpoll-FD investigation but didn't mention the residual pidfd accumulation we discovered post-merge. A follow-up Round-8 investigation (8 parallel agents, 2 rounds + direct /proc inspection on a live prerelease container) traced ~3.6 pidfds/hour, steady-state ~29, to: _check_subscription → quick_summary → FullSearchResults.batch_fetch_and_extract → AutoHTMLDownloader fallback → PlaywrightHTMLDownloader._fetch_with_playwright → sync_playwright().start() → asyncio.create_subprocess_exec(node-driver) # opens pidfd → driver fails (Chromium not installed in production ldr stage) → pidfd not closed on the failed-child exit CPython 3.14 ruled out as a confounder: subprocess.py uses waitpid(WNOHANG) polling, never opens pidfds. Only asyncio.create_subprocess_* and multiprocessing.Process can open them on Linux + Python 3.9+ via PidfdChildWatcher. PR #3971 (already merged) addresses this from a different angle: it makes web.enable_javascript_rendering default false, so AutoHTMLDownloader short-circuits before invoking Playwright. No subprocess spawned → no pidfd opened. Original motivation for #3971 was the confusing tracebacks reported in #3826; the FD-leak finding is the second motivation, captured here so a future reader sees both. The new bullet sits in Section B (flagged-then-verified-then-fixed) because the leak was real but is now resolved upstream. * docs(resource-cleanup): add FD-leak debugging playbook + CI considerations Add a new "Debugging FD leaks — playbook for the next one" section between the History (Waves 1-7) and "Intentionally not done" parts of the doc, capturing the diagnostic flow we developed across Waves 6 and 7 so future contributors don't re-derive it from scratch. Includes: - Symptoms that justify treating an issue as an FD leak (OSError 24, static-asset MIME errors, High FD count warnings, healthcheck hangs). - Host-side and inside-container snapshot scripts that work even when the container is too FD-starved for docker exec (host-side via sudo + /proc/$P/fd) and through the entrypoint's UID drop (--user 0 to docker exec). - Lookup table mapping each anon_inode / socket / pipe / REG flavor to its likely Python-level source and the path to deep-dive (e.g. /proc/PID/fdinfo/N's Pid: line for pidfds). - A pinpointing recipe per FD type — eventpoll (asyncio/httpx), pidfd (asyncio.create_subprocess / multiprocessing.Process), WAL/SHM (SQLCipher engine.dispose). - Pointer to the existing in-codebase instrumentation: _count_open_fds, the periodic Resource monitor log, fd_monitor.py, and the RUN_MANUAL_SMOKE-gated tests/manual_smoke/test_fd_smoke.py harness. - Honest discussion of why an automated per-PR FD-growth assertion is hard (transient FDs, CI-environment subprocess noise, namespace differences, slow-drip leaks needing hours of uptime) and what a nightly long-run job would look like if the team chooses to invest in one. - A "which Wave fixed which leak class" reference table so the next reporter can recognize a class and skip to the relevant precedent. No code changes. Pure documentation. * docs(resource-cleanup): add development-time detection + bpftrace recipes Extend the FD-leak debugging playbook with two industry-standard techniques that would have caught Waves 6 and 7 earlier, drawn from upstream Python docs and the wider production-tracing literature: 1. **bpftrace syscall-level pinpointing** (in the per-FD-type section). Trace pidfd_open / epoll_create1 / etc. on the host targeting the container's host PID; produces a histogram of every user stack that triggered the syscall, ranked by frequency. The hot stacks are the culprits. Would have caught the Playwright pidfd leak in seconds. 2. **Development-time detection (new subsection 4a)** — catches leaks at test time before they ship: - PYTHONASYNCIODEBUG=1 + -W default::ResourceWarning. Per the asyncio dev docs, unclosed transports emit ResourceWarning at GC time; the filter actually displays them. Would have surfaced the Wave 7 in-running-loop skip in any test that exercised ainvoke + safe_close on ChatOllama. - python -X dev for a one-flag local dev mode bundling ResourceWarning + asyncio debug + warnings as default. - pyproject.toml [tool.pytest.ini_options] examples for both "display" and "error" filter modes (with a caveat that error mode needs a targeted subset, not the whole suite, because third-party libs also emit ResourceWarning). - psutil's num_fds / open_files / connections as the cross-platform alternative to /proc/self/fd for unit tests on macOS dev environments. - tracemalloc + objgraph as the next-level tool when a leak is reproducible — diff allocations before/after, then render the reference chain holding the leaked wrapper alive. No code changes. The new tooling is recommendations only; no mandatory pytest config change in this commit. Future work could enable PYTHONASYNCIODEBUG=1 in the CI test environment if the overhead is acceptable. Citations to docs.python.org are inline for the load-bearing ResourceWarning claim. * test(fd-canary): pin asyncio.create_subprocess pidfd lifecycle in CI Add ``TestAsyncioSubprocessFDBaseline`` to ``tests/utilities/test_close_base_llm.py`` with two regression tests that run on every PR: 1. ``test_no_fd_growth_across_asyncio_subprocess_cycles`` — spawns ``/bin/true`` via ``asyncio.create_subprocess_exec`` 10 times and asserts total FD count delta ≤ +2. Pins the pidfd FD class against the child-watcher leak shape. 2. ``test_no_fd_growth_when_subprocess_fails_to_exec`` — same shape but with a deliberately-missing binary, mirroring the *exact* Wave-7 production failure mode (Playwright's Node.js driver being spawned, kernel returning ENOENT because Chromium wasn't installed, child watcher still expected to clean up the pidfd it opened *before* the failed exec). Why this is the right level --------------------------- LDR's own code does NOT call ``asyncio.create_subprocess_*`` (verified in R8C1). The production leak came from a transitive dependency (Playwright). So we cannot test LDR's call sites directly — there are none. Instead these tests pin the *platform baseline*: on this Python version, repeated asyncio subprocess cycles must not leak FDs. If a future Python upgrade, a child-watcher change, or a new direct asyncio.create_subprocess call in LDR breaks the close semantics, the next PR's CI fails on these tests — which is the canary signal we want. Linux-only via ``sys.platform != "linux"`` skip. pidfd_open is a Linux syscall; macOS uses a different watcher and Windows uses ProactorEventLoop. Both 'pass by virtue of nothing to leak', so restricting to Linux keeps the signal sharp (a failure on Linux is actionable; a pass on macOS is uninformative). Same +2 FD slack we use for the eventpoll canary above. A real 1-FD-per-iter leak across 10 iterations would land at delta=10, well past the threshold. Doc reference ------------- Updated ``docs/developing/resource-cleanup.md`` "Existing instrumentation" section to enumerate all four in-CI FD-growth canaries (two eventpoll, two pidfd) so future contributors see at a glance what's already guarded and where to extend coverage when a new leak class is found. |
||
|
|
3d0b7bb5f9 |
review: hoist asyncio+threading imports to module level + Wave 7 doc (#4048)
Addresses the AI Code Review nit on #4047: ``import threading`` (and the sibling ``import asyncio``) lived inside the ``_close_base_llm`` function body. There's no circular-import or optional-dependency reason to defer them; moving them to the top of the module improves readability and static analysis. Also extends ``docs/developing/resource-cleanup.md`` with a Wave 7 entry documenting: - The in-running-loop ``aclose`` skip bug (this PR's fix). - The healthcheck ``pidfd`` leak (Dockerfile change in the same PR). - The three gaps the broader audit during this PR surfaced as follow-up rather than in-scope work: ``OllamaEmbeddings`` httpx (same FD class as ChatOllama, no close path in langchain wrappers), ``auth_db`` / ``journal_quality`` engines escaping ``shutdown_databases``, and three RAG SSE endpoints constructing ``LibraryRAGService`` before the generator without a ``finally`` close. Also captures the negative results from the audit (non-Ollama providers safe via shared lru_cache, no subprocess pidfd risk, no raw event-loop creation, all ``open()`` calls inside ``with``) so a future contributor reading the history sees what was checked and ruled out. |
||
|
|
5ede95d3b4 |
docs(developing): add resource-cleanup.md capturing the FD-leak campaign (#3856)
Adds a single contributor-facing doc that explains how LDR manages process-level resources (DB sessions, LLM HTTP clients, search engines, threads) and the reasoning trail behind the current model. Why this isn't an ADR: ADRs (`docs/decisions/`) are for single-decision records. This doc is wider — it captures current architecture, a how-to cookbook, anti-patterns specific to this codebase, the chronological history of the FD-leak fix campaign (#1832 through #3855), and the deferred work list with reasoning. The history section consolidates ~14 weeks of iterative work across 15+ PRs into a single archive so future contributors hitting FD-shaped issues can see what's been tried, what worked, and what was ruled out without reconstructing it from `git log`. The "intentionally not done" section preempts re-discovery of deferred work as missing work. Related to #3816. The companion code fix is #3855. Co-authored-by: Daniel Petti <djpetti@gmail.com> Co-authored-by: r69 <143521130+r69shabh@users.noreply.github.com> Co-authored-by: Chris Dzombak <chris@dzombak.com> |
||
|
|
061cd83dd4 |
feat: add is_lexical flag to auto-enable LLM relevance filtering for keyword-based engines (#3403)
* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia, GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via a new `needs_reranking` class attribute. This fixes the priority bug where the global `skip_relevance_filter=True` incorrectly overrode auto-detection for engines that genuinely need filtering. Priority is now: per-engine setting > needs_reranking > global skip. The global skip only affects unclassified engines. Closes #2297 * fix: address 7 code-review issues on needs_reranking branch 1. Rename needs_reranking → needs_llm_relevance_filter for consistency with enable_llm_relevance_filter and skip_relevance_filter naming 2. Fix Paperless dead code: replace non-existent _apply_content_filters with proper _filter_for_relevance() call in custom run() override 3. Fix misleading skip_relevance_filter description to accurately reflect checkbox behavior and keyword engine exceptions 4. Delete 4 vacuously-true inline tests that duplicated factory logic instead of calling the real factory (coverage tests already exist) 5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md 6. Clarify is_generic comment: generic does not imply good ranking 7. Upgrade no-LLM log from debug to warning when filtering was requested but no LLM is available (with should_filter guard) * fix: remove Paperless fallback that overrode valid empty LLM filter results Replace the fallback that restored all previews when the LLM filter returned empty with an info log. The base class _filter_for_relevance() already handles errors internally (returns previews[:5] on exception or JSON parse failure). An empty result means the LLM legitimately found nothing relevant — trust it, don't override it. * refactor: rename needs_llm_relevance_filter → is_lexical The flag describes what the engine IS (lexical/keyword-based search) rather than what it needs. This is a general classification that can drive multiple behaviors beyond just the relevance filter — e.g. query optimization strategies, result deduplication, or UI hints. Matches the existing is_* naming pattern (is_scientific, is_generic). * Revert "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commit |
||
|
|
05b96fbe3f |
refactor: move engine module paths from settings DB to hardcoded registry (#2843)
* refactor: move engine module paths from settings DB to hardcoded registry Engine implementation details (module_path, class_name, full_search_module, full_search_class) are internal wiring, not user configuration. Storing them in the settings DB created a security attack surface requiring blocklist validation and route blocking. Changes: - New engine_registry.py with frozen dataclass entries for all 24 engines - search_engines_config.py injects registry data after loading DB settings - search_engine_factory.py passes engine_config to full search wrapper - Remove ~52 module/class entries from 9 JSON defaults files - Remove BLOCKED_SETTING_PATTERNS, is_blocked_setting(), and 4 call sites - Remove absolute→relative normalization from module_whitelist.py - Update docs, tests, and golden master * fix: remove TestGetBlockedSettingsError that references removed function The get_blocked_settings_error() function was removed as part of the engine registry refactor. This test class was added on main after the PR was created and wasn't caught by conflict resolution. * fix: remove TestSaveSettingsPostBlockedSetting that tests removed blocking logic BLOCKED_SETTING_PATTERNS and is_blocked_setting() were removed as part of the engine registry refactor. This test was added on main and references the now-removed blocking behavior. * fix: inject ENGINE_REGISTRY into parallel/meta engine _get_search_config() Both ParallelSearchEngine and MetaSearchEngine manually extract config from settings_snapshot without going through search_config(). Since module_path/class_name are no longer in the settings DB (they live in the hardcoded registry), these engines would silently fail to discover sub-engines on fresh installations. Fix: inject ENGINE_REGISTRY values after extraction, matching the pattern used in search_config(). Also fixes MetaSearchEngine's stale check for "search.engine.auto.class_name" in settings_snapshot — this key no longer exists in settings DB, so auto engine config would be skipped. * fix: update tests for engine registry refactor - test_whitelist_config_consistency: check ENGINE_REGISTRY instead of JSON defaults (module_path/class_name no longer in defaults) - test_meta_search_engine_high_value: expect registry-injected module_path/class_name in _get_search_config() output - test_meta_search_engine_extended: registry overwrites snapshot values - test_settings_routes_coverage: remove blocked setting tests (blocking logic removed — registry is now the security mechanism) - test_settings_routes_deep_coverage2: same as above * fix: add 5 missing engines to registry, strip module_path from their settings Add gutenberg, openlibrary, pubchem, stackexchange, and zenodo to ENGINE_REGISTRY (were added to main in #1540 after this branch diverged). Remove module_path/class_name from their settings JSON files and golden master, matching the pattern established for all other engines. Expand test_engine_registry.py to scan per-engine settings_*.json files and verify no settings files still contain module_path/class_name. * fix: inject full_search_module/class in meta/parallel engine _get_search_config() The registry injection in MetaSearchEngine and ParallelSearchEngine was missing full_search_module and full_search_class fields, making it inconsistent with the main search_config() injection. This would cause full-search wrappers to fail when created through meta/parallel engines. * fix: resolve pre-commit formatting issues and sync pdm.lock after merge with main |
||
|
|
890c84e534 |
docs: link auto-generated Configuration Reference across docs & fix stale env var docs (#2472)
- Add "Config Reference" link to Settings page "Learn & Get Help" bar - Overhaul docs/env_configuration.md: remove stale Dynaconf references, fix wrong double-underscore env var format, remove documented-as-fixed bug, replace duplicate tables with links to CONFIGURATION.md - Fix broken case-sensitive link in docs/deployment/unraid.md - Add CONFIGURATION.md cross-references to 12 docs' "See Also" sections - Update .env.template with correct LDR_-prefixed variable names - Add config reference comment to docker-compose.yml environment block |
||
|
|
465b0f3e9e |
docs: Add architecture, extension guide, and troubleshooting documentation
Add comprehensive documentation for contributors and users: - docs/architecture/OVERVIEW.md: System architecture with Mermaid diagrams covering components, research flow, threading model, and configuration - docs/architecture/DATABASE_SCHEMA.md: Complete database schema with ER diagram documenting all 40+ models - docs/developing/EXTENDING.md: Extension guide for adding custom search engines, strategies, LLM providers, and LangChain retrievers - docs/troubleshooting.md: Common issues and solutions for LLM, search, database, WebSocket, Docker, and API problems |