mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-16 20:10:39 +03:00

Files

LearningCircuit e79a9fb76a docs(resource-cleanup) + fix: Round 9 audit results + per-user lock-dict cleanup (#4077 )

* docs(resource-cleanup): Round 9 audit results + conditional deferred fixes

Capture the Round 9 broader-resource-audit results so:
- Future contributors don't re-audit the same paths
- If a relevant production symptom ever appears, the doc points
  directly at a pre-thought-through conditional fix

Round 9 ran two passes of three parallel agents looking for resource
leaks BEYOND FDs (memory/cache growth, thread/lock lifecycle, DB state
hygiene). Round 1 produced six HIGH-confidence findings; Round 2
verification refuted four of them and downgraded one.

Added to the audit ledger (next to the existing Wave 7 entry):

- Refuted findings with WHY they were refuted:
  - @cache on get_available_providers (called with None — hashable,
    cardinality 1; dicts would raise TypeError, not cache silently)
  - ThreadLocalSession identity-map growth (expire_on_commit=True
    default clears the map on every commit)
  - token_usage table unbounded growth (design-intentional permanent
    audit table; time-series compound indexes; /api/context-overflow
    queries historical windows)
  - search_calls table unbounded growth (same shape and verdict)

- Three per-user lock dicts (_user_init_locks, _user_locks,
  _user_critical_locks): technically correct that they never clean
  up on user delete, but ~296 bytes per user × 3 dicts = ~900 KB
  ceiling at 1000 users. Practically negligible.

- app_logs (ResearchLog) table — the one finding that survived
  verification as a real but small concern. No auto-retention; only
  cleaned by cascade-delete when parent Research row is manually
  removed. For users keeping all research, logs accumulate.

Added to "Intentionally not done (deferred)":

- app_logs retention setting + scheduled cleanup job. Includes the
  trigger conditions that would justify the work and the
  implementation sketch (settings key, daily APScheduler job,
  regression test, news fragment).

- Per-user lock dict cleanup on user delete. Cosmetic; included with
  trigger conditions and one-line-per-file sketch so it's actionable
  if multi-user deployments ever see it.

No code changes. Documentation only.

* fix(resource-cleanup): pop per-user lock-dict entries on user close

Three module-level per-user lock dicts had no removal hook, so each
accumulated one ``threading.Lock`` entry per username over the
process lifetime:

- ``_user_init_locks`` in ``database/library_init.py`` (serializes
  collection-init check-then-insert)
- ``_user_locks`` in ``database/backup/backup_service.py`` (per-user
  backup serialization)
- ``_user_critical_locks`` on ``QueueProcessorV2`` (per-user
  count-then-start critical section)

The ceiling was ~296 bytes/entry × 3 dicts ≈ ~900 bytes per user
across all three — bounded by total user count, microscopic relative
to the eventpoll FD leak that motivated the original investigation,
but real for long-lived multi-user instances with user-account
churn. Identified in Round 9 of the broader resource-leak audit
(see docs/developing/resource-cleanup.md).

The fix:

- Each module now exposes a ``pop_user_*_lock(username)`` function
  (or method for the QueueProcessor instance dict) that pops the
  entry under the existing per-dict lock.
- ``connection_cleanup._pop_per_user_locks(username)`` is a shared
  helper that lazy-imports and calls all three with individual
  try/except blocks so a failure in one doesn't skip the others.
- The helper is invoked from both user-close paths:
  - ``cleanup_idle_connections`` (the 5-minute sweeper) after
    ``db_manager.close_user_database``
  - ``web/auth/routes.py`` after the logout and password-change
    ``close_user_database`` calls

Pattern mirrors the existing per-user cleanup in those code paths
(scheduler.unregister_user, session_password_store.clear_*).

Tests in ``tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks``:

- Unit test: populate all three dicts, call the helper, assert all
  three entries are gone.
- Idempotency: pop on a never-registered user must not raise.
- Integration: ``cleanup_idle_connections`` actually invokes the
  helper for each user it closes (verified via the library-init dict
  for "alice").

Doc updated: the entry that R9A2 identified as "technically correct,
practically negligible" is moved from the audit ledger's findings
list into a "Fixed in this PR" subsection; the matching
deferred-fix entry in "Intentionally not done" is removed.

Adds a towncrier bugfix fragment.

* review: address self-review findings on PR #4077

Four fixes from a multi-round agent review of this PR:

1. Move ``_pop_per_user_locks`` outside the ``close_user_database``
   try/except in ``connection_cleanup.py``. Previously the pop was
   inside the same try block, so a DB-close failure — the path that
   ``test_close_failure_does_not_abort_loop`` already exercises —
   skipped the pop and leaked the lock-dict entry. New test
   ``test_pop_runs_even_when_close_user_database_fails`` pins this.

2. Bump three ``logger.debug`` to ``logger.warning`` in
   ``_pop_per_user_locks``. Matches the sibling scheduler-unregister
   error handler in the same module; debug-level silently masked
   lock-dict accumulation across cycles.

3. Doc accuracy fix in ``docs/developing/resource-cleanup.md``.
   The entry called all three dicts "module-level" but
   ``_user_critical_locks`` is an instance attribute on
   ``QueueProcessorV2``. Rewrote to distinguish module-level dicts
   from the instance attribute and to note the pop now runs outside
   the close try/except.

4. Integration test pre-populates all three lock-dict entries and
   asserts all three are absent post-cleanup, not just
   ``_user_init_locks``. Switched the test username from "alice"
   (used by other tests in the module) to a dedicated sentinel.

Tests: 23/23 in ``tests/web/auth/test_connection_cleanup.py``.

Race-condition concerns flagged in Round 1 (TOCTOU between pop and
``_get_user_*_lock``) were verified in Round 2 to be either guarded
by Python's reference-counted lock semantics (``with lock:`` keeps
the original lock object alive after dict pop) or bounded to a +1
race window across multiple browser sessions — not ship-blocking.
The narrow ``library_init`` race surfaces as a propagated
``IntegrityError`` on collection insert, not silent corruption.

2026-05-17 15:40:11 +02:00

44 KiB

Raw Blame History

Resource cleanup in LDR

This document captures how LDR manages process-level resources (DB connections, HTTP clients, file descriptors, threads) and the reasoning trail behind the current model. It exists because file-descriptor exhaustion has been a recurring class of bug in LDR, and the journey of fixing it — what's been tried, what worked, what was ruled out — is not reconstructable from git log alone.

If you're contributing code that holds a network connection, a database session, an LLM client, or a thread, read this before adding __del__, weakref.finalize, or a context manager.

Current model

Database connections

One shared per-user QueuePool. No per-thread engines. Pool sizing: pool_size=20, max_overflow=40, with periodic dispose() every 30 minutes.
SQLCipher is decrypted once per connection-open. PRAGMA key takes ~0.2 ms; pool reuse keeps that off the hot path.
Engines are created at login, closed at logout (or process exit via the registered atexit shutdown).
Background threads (research workers, metric writers, news scheduler jobs) use the same per-user pool — they no longer maintain a separate thread-engine system.

See ADR-0004 for the QueuePool-vs-NullPool decision and PR #3441 for the per-thread-engine removal.

LLM wrappers

LDR wraps every LLM in ProcessingLLMWrapper (and optionally RateLimitedLLMWrapper) so that callers see a uniform interface and the project owns the close path:

caller -> ProcessingLLMWrapper.close()
       -> _close_base_llm(base_llm) in utilities/llm_utils.py
       -> for ChatOllama:
            sync httpx client (ollama.Client._client) .close()
            async httpx client (ollama.AsyncClient._client) .aclose()
       -> for ChatOpenAI / ChatAnthropic:
            no close (those use @lru_cache'd shared httpx clients)

Key invariants:

ChatOllama is the only provider where _close_base_llm() actually closes anything. ChatOpenAI and ChatAnthropic share LRU-cached httpx clients across instances; closing them would break other live LLMs.
Both _client (sync) and _async_client (async) are released — the async side is exercised by every ainvoke() call (langgraph agents, modular strategies). Closing only the sync side leaks the async transport per call (root cause of #3816).
The function is idempotent via an _ldr_closed sentinel on the inner httpx clients.
The async close uses asyncio.run(client.aclose()) only when no event loop is currently running. When called from inside async code it skips and leaves the close to the loop's owner.

Search engines

BaseSearchEngine.close() is the single entry point and cascades into _preview_filters and _content_filters. That cascade is what releases per-engine LLMs (e.g., JournalReputationFilter.model), SearXNG sessions, and other filter-held resources.
Search-engine cleanup happens at the per-research finally block in web/services/research_service.py:run_research_process() and at the programmatic API entry points in api/research_functions.py.
The _owns_llm flag pattern (introduced in #2712) tracks whether a filter or engine constructed its own LLM (and thus owns it) versus borrowed one from a caller (and must not close it).

Thread lifecycle

@thread_cleanup (decorator on run_research_process and similar workers) ensures thread-local DB sessions are released even on abnormal exits.
cleanup_current_thread() is called from Flask teardown, the queue processor, the auth flow, and the RAG routes — six tier-1 paths in total.
Background threads are daemon threads; the process exit handles any thread that did not clean up gracefully.

Conventions

Use safe_close(resource, "human name") from utilities/resource_utils.py for every cleanup. Never bare .close() in a finally (it can mask the original exception).
Prefer try/finally over __del__. Python doesn't guarantee finalization order at interpreter exit; __del__ interacts subtly with reference cycles and weakref.
Track ownership explicitly with _owns_llm (or analogous flag) when a class accepts an injected resource that may or may not be its own.
News fragments (changelog.d/<id>.bugfix.md) are required for any user-visible cleanup behavior change — see changelog.d/README.md.

How to close X correctly

You're holding	Do this
A `ChatOllama` (raw or wrapped)	Call `wrapper.close()` in a `finally`, or pass to `safe_close(wrapper, "...")`. The wrapper chain handles both sync and async httpx clients.
A search engine you constructed	`safe_close(engine, "...")` in `finally`. The engine's `close()` cascades into preview/content filters.
A holder class with an LLM	Add a `close()` method, gate the LLM close on `self._owns_llm`, document who calls it. Don't add `__del__`.
A long-lived service holder (news scheduler, etc.)	Wrap construction in `try/finally` at the cycle boundary. Don't store the LLM if you can recreate it cheaply.
A DB session	Use `with get_user_db_session(username) as session:`. Don't bypass via `get_settings_manager(username=...)` without `owns_session=False` (see #3023).
An asyncio event loop	Use the existing one. If you genuinely need a new one (background thread fallback), call `loop.close()` in a `finally` — see `news_strategy.py` for the reference pattern (post-#3018).

Anti-patterns

These look reasonable but break specific things in this codebase:

Adding __del__ to a class with close(). At interpreter exit the logger, httpx, and event-loop modules may already be torn down. __del__ can run after them and raise. Use explicit close in a finally instead.
Closing a shared httpx client. ChatOpenAI / ChatAnthropic share one httpx pool across instances via @lru_cache. Closing it kills every other live LLM in the same process. The Ollama check in _close_base_llm exists exactly to gate this.
Truthy idempotency sentinels on Mock objects. Mock() without a spec auto-generates child Mocks for any attribute access, so getattr(client, "_ldr_closed", False) returns a truthy Mock and short-circuits the close. Always use is True / is None checks for sentinels — see the pattern in _close_base_llm.
Skipping super().close() in a search-engine subclass. BaseSearchEngine.close() is what cascades into preview/content filters. Override it without calling super and you leak every filter's resources (this was a Copilot finding on #3818).
Treating asyncio.run() as safe inside an event loop. It raises RuntimeError if called from a thread that already has a running loop. The pattern in _close_base_llm is: detect a running loop with get_running_loop(), skip the async close in that branch (the loop owner will close), only call asyncio.run in the no-loop case.

History

The FD-leak campaign spans roughly four months of iterative work. Each fix narrowed the remaining surface; each subsequent issue was found in a corner the previous wave hadn't touched.

Wave 1 — initial leak inventory (Jan 2026)

#1832, #1849, #1856, #1860 — first comprehensive sweep. Identified seven distinct leak sources: auth_db engine, download_management DB, search cache, subprocess zombies, HTTP sessions in SemanticScholarSearchEngine and BaseDownloader, Socket.IO threads. Established context-manager + try/finally patterns. Added a pre-commit hook to catch missing cleanup at commit time.

Wave 2 — thread-local engine accumulation (Mar 2026)

#2495 — diagnosed that Flask's teardown only cleaned the request-scoped g.db_session while a separate _thread_engines dict accumulated NullPool engines per thread, leaking ~3 FDs per request. Added cleanup_current_thread() across six tier-1 paths.
#2591 — dead-thread engines (when threads crashed they left engines behind) plus stream=True socket holds in the generic downloader. Added a throttled dead-thread sweep, removed stream=True, raised the Docker ulimit from 1024 to 65536.

Wave 3 — LLM wrapper lifecycle (Mar 2026)

#2708 — diagnosed ChatOllama → httpx.Client chains with no __del__. With the news scheduler triggering 50–300 quick_summary() calls per hour, a 1024-FD container exhausted in 3–4 hours. Wrapped four programmatic API entry points in try/finally with explicit close.
#2712 — extracted close_llm() to a shared utility. Added close() and _owns_llm to NewsAnalyzer, HeadlineGenerator, TopicGenerator, JournalReputationFilter, DomainClassifier, GitHubSearchEngine, IntegratedReportGenerator, ElasticsearchSearchEngine, and the benchmark graders.
#2756 — wrapped bare .close() calls in finally blocks with safe_close() to prevent masking the original exception.
#2732 — moved close() into ProcessingLLMWrapper and RateLimitedLLMWrapper directly; eliminated the standalone close_llm() free function.

Wave 4 — DB session leaks + per-call patterns (late Mar / early Apr 2026)

#3018 — get_settings_manager(username=...) was bypassing g.db_session and creating QueuePool sessions per-thread; live diagnostics showed 321 sockets allocated, only 66 in use. DownloadService.close() leaked the inner SettingsManager session. Also fixed TopicBasedRecommender._create_recommendation_card() (per-call LLM with no cleanup) and an asyncio.new_event_loop() in news_strategy.py that never closed.
#3204 — test fixtures using return instead of yield left engines un-disposed. Migrated 8 test files to yield + engine.dispose().

Wave 5 — DB pool architecture (Apr 2026)

#3340 — kept QueuePool but minimized FDs (pool_size=1, max_overflow=2, periodic dispose() every 30 min).
#3337 (closed) — proposed switching SQLCipher engines to NullPool for zero persistent FDs. Superseded by #3441.
#3441 — removed per-thread NullPool engines entirely (~2,100 lines of sweep logic deleted) and routed metrics through a single shared per-user QueuePool with bounded sizing (pool_size=20, max_overflow=40).
#3477 — created ADR-0004 capturing the final pool model and updated stale FD calculations across docs.

Wave 6 — async client close (May 2026)

#3818 (open, declined for merge) — proposed session-pooling around safe_get/safe_post to address #3816. The session refactor is reasonable in isolation, but the lsof in #3816 showed ~72% of leaked FDs as a_inode [eventpoll] selectors, not HTTP request sockets — pointing at async-client transports rather than safe_get callers (whose response bodies were already consumed). See the PR comment for the full reasoning.
#3855 — extended _close_base_llm() to also close ChatOllama._async_client (the actual gap the lsof pointed to). Added the IntegratedReportGenerator close that was missing from the per-research finally block. Idempotency via _ldr_closed sentinels on the inner httpx clients.

Wave 7 — async close inside a running loop (May 2026)

#4047 — _close_base_llm's async branch had a documented "skip if a loop is running; loop owner closes" path. No loop-owner cleanup code existed anywhere in the project, so when the close was called inside an active asyncio loop the inner httpx.AsyncClient (and its epoll_create FD) was silently abandoned. Reproduced in production: a v1.6.10 single-host Ollama container reached 1024 FDs with the /proc histogram showing 929 anon_inode:[eventpoll] (91%) — the same FD class as #3816 but in a code path #3855's fix didn't cover. The fix runs the async close in a brief daemon thread that owns its own loop, so asyncio.run(aclose()) works regardless of the caller's loop state. A bounded 5-second join keeps the cleanup from blocking shutdown when the Ollama server is unresponsive; on timeout _ldr_closed is left unset so a later call retries, and a WARNING surfaces so the situation is observable instead of silent.
Healthcheck pidfd leak (same PR). Dockerfile's HEALTHCHECK CMD python -c "... urllib.request.urlopen(...)" had no timeout= argument; Docker's 10s timeout SIGKILL'd the sh -c parent but the python child was reparented to PID 1 and hung forever, each surviving child holding a pidfd + TCP socket against the app. Same /proc dump showed 64 anon_inode:[pidfd] (6%) from this. Adding timeout=8 lets the child return/raise inside Docker's budget so it exits cleanly and gets reaped.

Audit ledger — what the broader sweep checked

The PR included a wide audit (50+ parallel exploration agents across seven rounds plus direct /proc inspection) to catch any other latent FD leak. To save the next contributor from re-running the same checks, here is the full ledger:

Checked and confirmed clean (no action needed)

Non-Ollama LLM providers. xAI, Google Gemini, OpenRouter, IONOS, LM Studio, llama.cpp HTTP, DeepSeek, OpenAI-compatible endpoint, plus OpenAI and Anthropic themselves. All extend ChatOpenAI or ChatAnthropic, which use @lru_cache'd shared httpx clients. _close_base_llm's short-circuit on these classes is correct by design — closing them would brick every other live LLM in the process.
HTTP session lifecycle. Six instantiation sites checked (PricingFetcher aiohttp, LDRClient SafeSession, BaseDownloader, SemanticScholarSearchEngine, MCPClient, CostCalculator). All context-managed via with or owned by a class with a paired close() and __exit__.
subprocess / pidfd. Three call sites, all subprocess.run() (blocking). No subprocess.Popen paths anywhere in src/. No ProcessPoolExecutor. No FD leak surface beyond the healthcheck child, already addressed by the Dockerfile timeout=8 change.
asyncio event loops. Zero raw asyncio.new_event_loop() outside safe asyncio.run() patterns. The historical leak in news_strategy.py (#3018) is still fixed.
File handles. All 37 open() call sites are inside with. Zero bare opens. tempfile.NamedTemporaryFile / TemporaryDirectory all context-managed.
SocketIO connect/disconnect. Non-disconnect handlers (subscribe, unsubscribe, connect) do not acquire DB sessions (an early-round agent claim that they did was refuted on re-read). The __socket_subscriptions dict is cleaned on disconnect. The PID-1 FD breakdown showed only 3 sockets out of 1024 — socket accumulation is not a contributor.

Flagged by audit, then verified NOT a real FD leak

OllamaEmbeddings httpx. LDR imports the deprecated langchain_community.embeddings.OllamaEmbeddings, which uses requests.post() per call — no persistent httpx client, no _client / _async_client attribute. Direct introspection: [a for a in dir(e) if 'client' in a.lower()] returns []. Zero FDs per call. An audit agent confused this class with ChatOllama, which is a different class. See the migration note in the next subsection — this changes when langchain forces the move.
auth_db and journal_quality engines escaping shutdown_databases(). auth_db uses QueuePool(pool_size=10, max_overflow=20) and journal_quality uses StaticPool with immutable=1. Both are bounded and do not grow at runtime. Live /proc on the affected container showed only 21 SQLite-related FDs total on PID 1 — well below the ~91-FD ceiling these unmanaged engines could theoretically reach. The kernel reclaims FDs at process exit regardless of engine.dispose(), and SQLite WAL files auto-checkpoint on next open. Missing dispose at exit is hygiene, not a leak.
LibraryRAGService in three RAG SSE endpoints. rag_routes.py:693, 1054, 1827 do construct the service outside the generator and never close it, but LibraryRAGService.close() only sets references to None — it releases no FDs. FAISS uses pickle.load() (not mmap); OllamaEmbeddings holds no FDs per the item above; the SentenceTransformer model+tokenizer mmaps are process-wide singletons. What gets delayed is ~50–200 MB of embedding-model RAM until GC. A memory-pressure question, not the eventpoll FD class this Wave addressed.
Residual pidfd accumulation via Playwright fallback — identified in a Round-8 follow-up after the eventpoll fix landed. Live /proc on the prerelease container showed ~29 pidfds steady state, growing ~3.6/hour, all targeting Pid: -1 (children that had exited). Rate was stable during active benchmark execution, ruling out a per-task source. Eight parallel agents converged on the same chain: _check_subscription → quick_summary → FullSearchResults.batch_fetch_and_extract → AutoHTMLDownloader fallback to PlaywrightHTMLDownloader._fetch_with_playwright. Each sync_playwright().start() invokes asyncio.create_subprocess_exec() for the Node.js driver (opens a pidfd via Linux's PidfdChildWatcher); the driver then fails because Chromium is not installed in the production ldr Dockerfile stage (only ldr-test runs playwright install --with-deps chromium), and the asyncio child watcher does not promptly close the pidfd on the failed-child exit. CPython 3.14 was confirmed to not use pidfd in subprocess.py at all (subprocess.run/Popen use waitpid(WNOHANG) polling), so subprocess-based hypotheses were ruled out. Fixed by PR #3971 (default web.enable_javascript_rendering=false): the fallback short-circuits before any subprocess is spawned, so no pidfd is opened. The PR was motivated by issue #3826 (confusing tracebacks); the FD-leak finding is the second motivation, surfaced here.

Minor findings (not steady-state leaks; worth knowing)

Daemon threads without explicit shutdown. search_cache.py cache-cleanup thread, journal_reputation_filter.py background fetcher, log_utils.py queue processor, parallel_search_engine.py global executor. All daemonized — reaped by the OS at process exit. Not steady-state leaks; no per-request growth.
Abandoned-research thread on socket disconnect. If a client closes the tab mid-research, the socket subscription is removed but the research thread keeps running until completion; _active_research[research_id] is not cleared on disconnect. Not an FD leak; potentially compute/memory waste if the user wanted the research to stop. Out of scope for the FD-leak story.

Future-proofing note — `langchain_ollama.OllamaEmbeddings` migration

langchain_community.embeddings.OllamaEmbeddings is deprecated ("will be removed in langchain 1.0.0", per the import warning). Its replacement, langchain_ollama.OllamaEmbeddings, does carry _client and _async_client attributes — same shape as ChatOllama. Verified by direct introspection:

langchain_ollama.OllamaEmbeddings client attrs:
  ['_set_clients', 'async_client_kwargs', 'client_kwargs',
   'sync_client_kwargs']
Has _client?       True
Has _async_client? True

So: today's LDR is FD-safe on the embeddings side by accident (using the deprecated class). The day the deprecated class is removed upstream and LDR migrates, the in-running-loop eventpoll FD leak class will reappear for embeddings unless _close_base_llm is generalized to introspect either a chat model or an embeddings instance with the same _client / _async_client shape. The extension is a small type-check broadening, not a redesign.

Round 9 — broader resource audit (May 2026)

Once the FD-leak classes were closed, a follow-up audit looked for other slow-growth patterns that wouldn't trip the FD counters but could still degrade a long-running container: memory and cache growth, thread / asyncio Task / lock lifecycle, DB state hygiene beyond connections. Three parallel agents per round, two rounds (Round 1 hypothesis generation, Round 2 fact-check), captured here in verified form so the next contributor doesn't re-derive the same conclusions.

Refuted (false positives from Round 1, verified in Round 2)

@cache on get_available_providers (config/llm_config.py:158). Round 1 claimed unbounded cache growth if the function were called with differing settings_snapshot dicts. Round 2 verified: dicts are unhashable, so @cache would raise TypeError on them, not silently grow. In practice the call sites pass settings_snapshot=None (hashable, cardinality 1). Not a leak.
Thread-local Session identity-map growth (database/thread_local_session.py). Round 1 claimed long-running research threads would accumulate ORM objects in the per-thread Session's identity map. Round 2 verified: SQLAlchemy's default expire_on_commit=True clears the identity map at every commit; the codebase commits periodically. Bounded by typical query volume, not unbounded by uptime.
token_usage table unbounded growth. Append-only per LLM call with no TTL or retention job. Round 2 verified: feature by design. Schema has compound time-series indexes (idx_token_research_timestamp, etc.); /api/context-overflow and /metrics/api/metrics explicitly query historical windows for cost analysis. The table is a permanent audit trail by intent. Adding retention would break the metrics dashboards.
search_calls table unbounded growth. Same shape and same verdict — compound time-series indexes confirm intentional design as a permanent search-analytics record.

Fixed in this PR — three per-user lock dicts

Three per-user lock dicts — _user_init_locks and _user_locks are module-level dicts in database/library_init.py and database/backup/backup_service.py respectively; _user_critical_locks is an instance attribute on the QueueProcessorV2 singleton in web/queue/processor_v2.py. Each stored one threading.Lock per username with no removal hook. Bounded ceiling (~296 bytes/entry × 3 dicts at 1000 users = ~900 KB), so not urgent — but easy to fix cleanly. The two module-level dicts now expose pop_user_init_lock / pop_user_lock functions; the queue processor exposes the equivalent as an instance method queue_processor.pop_user_critical_lock. A shared _pop_per_user_locks(username) helper in connection_cleanup.py calls all three with lazy imports and individual try/except (WARNING-level so dict accumulation is observable, matching the sibling scheduler-unregister error path). The helper is invoked unconditionally — outside the close_user_database try/except so it still runs when the DB close itself fails — in both the idle-connection sweeper (connection_cleanup.py:cleanup_idle_connections) and the logout / password-change paths (web/auth/routes.py). Tests in tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks cover the helper directly and through the idle-close path.

Real but small (survives verification)

app_logs (ResearchLog) table — no automatic retention. Grows by ~100s-1000s of rows per research. Cleaned only via cascade-delete when the parent Research row is deleted manually. Unlike token_usage / search_calls, this table has no UI dashboard or time-series API consuming it — it's debug context for a specific research session, not an analytics record. For users who keep all research, logs accumulate indefinitely. See "Intentionally not done (deferred)" for the retention design when a symptom report justifies it.

Debugging FD leaks — playbook for the next one

When the next FD leak shows up (and there will be one, eventually), this section is the shortcut. It captures the actual diagnostic flow that worked across Waves 6 and 7 so a future contributor doesn't have to re-derive it from the symptom.

0. Symptoms that mean "investigate this as an FD leak"

Tracebacks like OSError: [Errno 24] Too many open files, typically from selectors.DefaultSelector() in werkzeug or send_from_directory in Flask. These are usually the first visible failure.
Browser-side MIME-type errors on static assets (text/html instead of text/css / application/javascript). These are downstream of FD exhaustion — Flask can't open the static file, returns an HTML 500, and the browser refuses to apply it because of X-Content-Type-Options: nosniff.
High FD count (N) — approaching system limit warnings from web/auth/connection_cleanup.py (fires at FD > 800 every 5-minute cleanup tick).
Container health turns unhealthy because the healthcheck urlopen hangs on a process that no longer has FDs to accept connections.

1. Capture diagnostic state BEFORE restarting

The single most important rule: the snapshot does not survive a container restart. Every minute spent on the live broken container is worth an hour of after-the-fact agent guessing. Save the diagnostic output to a host-side file first.

One-shot host-side snapshot (works even when the container is

FD-starved enough that docker exec can't fork)

# Run on the Docker host. No docker exec required.
P=$(docker inspect -f '{{.State.Pid}}' <container-name>)
sudo bash -c "
  echo '=== Total FDs ==='
  ls /proc/$P/fd | wc -l
  echo '=== FD-type histogram (digits collapsed) ==='
  ls -l /proc/$P/fd | awk '{print \$NF}' \
    | sed -E 's/\[[0-9]+\]/[N]/g; s/[0-9]{4,}/NUM/g' \
    | sort | uniq -c | sort -rn | head -30
  echo '=== Counts by category ==='
  printf 'socket:     %s\n' \$(find /proc/$P/fd -lname 'socket:*'     | wc -l)
  printf 'pipe:       %s\n' \$(find /proc/$P/fd -lname 'pipe:*'       | wc -l)
  printf 'eventpoll:  %s\n' \$(find /proc/$P/fd -lname '*eventpoll*'  | wc -l)
  printf 'pidfd:      %s\n' \$(find /proc/$P/fd -lname '*pidfd*'      | wc -l)
  printf 'WAL files:  %s\n' \$(find /proc/$P/fd -lname '*-wal'        | wc -l)
  printf 'SHM files:  %s\n' \$(find /proc/$P/fd -lname '*-shm'        | wc -l)
  printf '.db files:  %s\n' \$(find /proc/$P/fd -lname '*.db'         | wc -l)
" | tee /tmp/ldr-fd-snapshot.txt

Why host-side: reading the container's PID 1 FDs from inside the container requires the same UID that started PID 1. The Dockerfile entrypoint runs as root then setprivs to ldruser, so the docker exec shell (ldruser) cannot readlink PID 1's FDs even though it can count them. Host root via sudo sidesteps the UID check.

Inside-container alternative (if the host is locked down)

docker exec --user 0 <container-name> sh -c '...same body...'

--user 0 runs the exec'd shell as root inside the container, sidestepping the same UID restriction.

2. The lookup table — FD type → likely source

Dominant FD type	Likely source	Diagnostic deep-dive
`anon_inode:[eventpoll]`	`asyncio` event loop or `httpx.AsyncClient` selector. Each leaked async client = +1.	Grep `asyncio.create_subprocess`, `httpx.AsyncClient`, `_async_client`, `ainvoke`. See Wave 6, Wave 7.
`anon_inode:[pidfd]`	`asyncio.create_subprocess_*` or `multiprocessing.Process` (uses `pidfd_open` on Linux).	Read `/proc/PID/fdinfo/N` for each pidfd; the `Pid:` line shows the target (`-1` = child already exited).
`socket:*` (lots)	HTTP keep-alive, SSE streams, SocketIO connections.	Cross-reference with `/proc/PID/net/tcp` states; check Round 7 R7A8 patterns.
`pipe:*` (lots)	`subprocess.run`/`Popen` with `stdout=PIPE`, multiprocessing IPC, loguru queue.	Check `subprocess.run` sites and APScheduler executor type.
`REG` `-wal` / `-shm`	SQLCipher in WAL mode. Each pooled connection holds ~3 FDs.	See ADR-0004. If growing without bound, the periodic `engine.dispose()` is silently failing.
`REG /data/*.db` (lots)	Plain SQLite connections from an engine without bounded pool.	Audit `create_engine` sites (R7A6 caught two unmanaged ones).
`REG /home/...mmap...`	Memory-mapped model weights or FAISS indexes — usually process-wide singletons (not leaks).	Check whether the count grows per request. If yes → real leak.

3. Pinpointing the source for a specific FD type

Eventpoll

anon_inode:[eventpoll] always comes from EpollSelector — created by every asyncio loop and every httpx.AsyncClient. Grep:

grep -rn 'asyncio.create_subprocess\|httpx.AsyncClient\|_async_client' src/

Then check whether each site explicitly closes the client. The Wave 7 fix to _close_base_llm is the reference pattern for "close async httpx even when called inside a running loop."

Pidfd

Pidfds expose their target PID via fdinfo:

# Run inside the container (or via docker exec --user 0):
for fd in $(ls /proc/1/fd 2>/dev/null); do
  link=$(readlink /proc/1/fd/$fd 2>/dev/null)
  case "$link" in
    *pidfd*)
      tpid=$(awk '/^Pid:/ {print $2}' /proc/1/fdinfo/$fd 2>/dev/null)
      if [ "$tpid" -gt 0 ] 2>/dev/null; then
        cmd=$(tr '\0' ' ' < /proc/$tpid/cmdline 2>/dev/null | cut -c1-80)
        echo "fd=$fd alive pid=$tpid : $cmd"
      else
        echo "fd=$fd ORPHAN (child exited; pidfd not closed)"
      fi
      ;;
  esac
done

A high "ORPHAN" count = something called asyncio.create_subprocess_* or multiprocessing.Process, the child exited, but the pidfd in the parent was never closed. Common in Round-8: Playwright's Node.js driver subprocess failing because Chromium isn't installed in the production image.

Note: CPython 3.14's subprocess.py does not use pidfd at all (waitpid(WNOHANG) polling instead). So pidfds in a 3.14 process necessarily come from asyncio or multiprocessing, not from subprocess.run / Popen.

Syscall-level pinpointing with bpftrace (mysterious cases)

When the source isn't obvious from the FD type, bpftrace can record the Python stack of every relevant syscall on the live process. This would have caught the Playwright leak in seconds instead of two rounds of agent exploration. Requires kernel headers and bpftrace installed on the host (NOT the container — bpftrace runs in host kernel space and can target a host PID by number):

# Find host-side PID of container's PID 1
P=$(docker inspect -f '{{.State.Pid}}' <container>)

# Trace every pidfd_open syscall, grouped by user-stack:
sudo bpftrace -e "tracepoint:syscalls:sys_enter_pidfd_open
                  /pid == $P/ { @[ustack(perf)] = count(); }"

# Same idea for epoll_create / epoll_create1 (eventpoll FDs):
sudo bpftrace -e "tracepoint:syscalls:sys_enter_epoll_create1
                  /pid == $P/ { @[ustack(perf)] = count(); }"

Let it run for a minute, then Ctrl-C; you get a histogram of every unique stack that triggered the syscall, ranked by frequency. The hot stacks are your culprits. Works for any syscall — useful future candidates: socket, inotify_init1, timerfd_create, memfd_create.

WAL/SHM

engine.dispose() is expected to release these. If the count climbs across the periodic 30-minute dispose cycles, the dispose is silently failing. The observability commit (f86c3f7af) elevates dispose failures to WARNING — check the logs for Error disposing engine for <user>.

4. Existing instrumentation already in the codebase

_count_open_fds() at src/local_deep_research/web/auth/connection_cleanup.py:50 — fast /proc/self/fd-based counter with macOS fallback. Reusable.
Resource monitor: open_fds=… debug log line at connection_cleanup.py:184, fires every 5-minute cleanup tick.
High FD count (N) WARNING at connection_cleanup.py:190 when FDs exceed 800. The single most useful production signal.
fd_monitor.py (PR #3036) — cross-platform helper used by diagnostic endpoints.
tests/manual_smoke/test_fd_smoke.py (PR #3930, the test/manual-fd-smoke-suite branch) — opt-in pytest harness that runs the close cycle N times and asserts the FD count stays flat. Gated by RUN_MANUAL_SMOKE=1; not part of the default CI run because it needs a live Ollama and produces noise. Extend this suite when you ship a leak fix.
In-CI FD-growth canaries in tests/utilities/test_close_base_llm.py. These run on every PR:
- TestCloseBaseLLMRealHttpxAsync::test_no_fd_growth_across_repeated_close_cycles — guards the eventpoll FD class against Wave-6-shaped regressions.
- TestCloseBaseLLMRealHttpxAsync::test_no_fd_growth_when_closed_inside_running_loop — guards the Wave-7-shaped in-running-loop skip regression.
- TestAsyncioSubprocessFDBaseline::test_no_fd_growth_across_asyncio_subprocess_cycles — guards the pidfd FD class against the child-watcher leak shape.
- TestAsyncioSubprocessFDBaseline::test_no_fd_growth_when_subprocess_fails_to_exec — pins the exact Wave-7-pidfd shape (failed exec, child watcher must still clean up). Catches platform-level regressions in Python's asyncio child watcher.
All four use _open_fd_count() (also in that file) which reads /proc/self/fd on Linux with an RLIMIT_NOFILE fallback on macOS. Slack is +2 FDs across 5–10 iterations. A real per-cycle leak would blow past that.

4a. Development-time detection (catch leaks at test time)

Production /proc inspection catches leaks after they ship. The cheapest catch is to make Python itself complain at test time. Three Python features cooperate to surface unclosed resources during a normal test run — none of them were on by default during Waves 6 and 7, which is part of why those leaks made it to production.

PYTHONASYNCIODEBUG=1 plus -W default::ResourceWarning. When asyncio debug mode is on, unclosed transports/coroutines emit a ResourceWarning at GC time. The -W filter makes Python actually display them. Together they would have caught the Wave 7 in-running-loop skip: every leaked httpx.AsyncClient produces a visible warning the first time the GC sweeps after the test fixture exits. From the asyncio dev docs:

When a transport is no longer needed, call its close() method to release resources. ... If a transport or an event loop is not closed explicitly, a ResourceWarning warning will be emitted in its destructor.

To enable in pyproject.toml [tool.pytest.ini_options]:

filterwarnings = [
    "default::ResourceWarning",
]
env = [
    "PYTHONASYNCIODEBUG=1",
]

Or in CI for a one-off check:

PYTHONASYNCIODEBUG=1 python -W default::ResourceWarning -m pytest tests/

For a CI gate that fails on any leak (more aggressive — use only on a targeted subset of tests, not the whole suite, because third-party libraries also emit ResourceWarning):

filterwarnings = [
    "error::ResourceWarning",
]

python -X dev. Enables Python's dev mode, which turns on a bundle of safety checks including ResourceWarning display, asyncio debug mode, and warnings as default. Cheap one-flag alternative for local development; not recommended in production (overhead).

python -X dev -m pytest tests/

psutil for portable FD counting in tests. Our in-codebase _count_open_fds uses /proc/self/fd (Linux-fast path, macOS fallback). psutil is the cross-platform alternative many other projects use:

psutil.Process().num_fds() — Linux/BSD only; same number as our helper.
psutil.Process().open_files() — list of named files; gives the paths for REG-type FDs (e.g., /data/*.db-wal).
psutil.Process().connections(kind='all') — sockets visible to the process, with state and remote address.

These are useful in unit tests when you want to assert "no new file of pattern X is open after the close path runs," and they work on the macOS dev environments without /proc.

**For tracking which Python object holds a leaked FD: tracemalloc

objgraph.** Not FD tools per se, but when a leak is reproducible, take a tracemalloc snapshot before and after the suspect operation and diff — the new allocation is usually the wrapper holding the FD. objgraph.show_backrefs([leaked_obj]) then renders the reference chain keeping it alive. Both are pure-Python and zero-dependency.

5. Why we don't have an automated FD-growth test in CI

Several reasons, weighed during Wave 6 and Wave 7:

Per-request FD growth is hard to assert. Many legitimate request paths transiently open and close FDs; a noisy delta is the norm. Distinguishing "leak" from "in-flight" requires a stable quiescent state, which a CI test doesn't naturally provide.
The CI environment spawns its own subprocesses. pytest, coverage, gunicorn workers (for some test variants), gh-runner cleanups — all add their own FDs that pollute the count.
PID-namespace differences between CI and prod. Counts you observe in a CI container's /proc are not directly comparable to a production container's /proc; the subprocess sources differ.
The actual leaks have been "slow drip" patterns that need hours of uptime to surface. Wave 6's eventpoll leak took multiple hours of ainvoke calls to reach the 1024 cap. CI can't run for hours per PR.

What works instead:

Per-leak unit-level regression tests. Each fix in Waves 1-7 landed with a targeted test that exercises the specific close path (e.g. tests/utilities/test_close_base_llm.py::test_no_fd_growth_when_closed_inside_running_loop). These are fast, deterministic, and run on every PR.
Opt-in manual smoke suite (RUN_MANUAL_SMOKE=1) for the end-to-end "run-the-cycle-N-times-and-count" pattern, used during investigation but not on every CI run.
Production /proc inspection when a leak is suspected — the playbook above. Faster than CI for the long-drip patterns.

If you want to add a long-run CI job, the right shape would be a nightly workflow (not per-PR) that:

Builds the production Docker image.
Starts it with a synthetic user account and ~5 news subscriptions.
Lets it idle for 20-30 minutes.
Runs the host-side snapshot script above.
Asserts total FDs < N and eventpoll < M and pidfd < K, where the thresholds are tuned for the steady-state ceilings the codebase intentionally permits (auth_db pool, etc.).

That would have caught Waves 6, 7 in a single nightly cycle instead of through a user crash report. The reason it doesn't exist yet is cost (a half-hour idle job per night per platform) and the lack of a clear baseline; the Round-8 finding is the moment to consider adding one if you want to invest the maintenance time.

6. Lookup: which Wave fixed which leak class

FD class	Wave / PR	Root mechanism
`eventpoll`	Wave 6 #3855 + Wave 7 #4047	ChatOllama `_async_client` not closed (Wave 6) → also not closed when called inside a running loop (Wave 7).
`pidfd` from healthcheck	Wave 7 #4047	`urlopen` no `timeout=` → child hangs → reparented to PID 1 with pidfd held.
`pidfd` from Playwright fallback	Round 8 / #3971	Production image lacks Chromium binary; Playwright invocation opens pidfd then fails.
WAL/SHM accumulation	Wave 5 / ADR-0004	SQLCipher+WAL leaks handles on out-of-order close; periodic `engine.dispose()` resets the pool.
Per-thread engine FDs	Wave 5 #3441	Removed per-thread `NullPool` engines entirely; shared per-user `QueuePool`.
HTTP session sockets	Wave 1 / Wave 3	`SafeSession` / `BaseDownloader` close-in-`finally` discipline.
`asyncio.new_event_loop`	Wave 4 #3018	Replaced manual loop creation with `asyncio.run()` in `news_strategy.py`.

Use this table to skip the rediscovery step the next time a specific FD type dominates a snapshot.

Intentionally not done (deferred)

These showed up during planning and were deliberately not done. If they get rediscovered as "missing work" by future contributors, please reference this section first.

weakref.finalize defense-in-depth on the LLM wrappers. Designed and verified safe (no __del__ conflicts, __getattr__ doesn't intercept _finalizer, no reference cycles). Deferred until a fourth wave of "missed close" leaks justifies adding a new pattern that future contributors must understand. Current explicit-close discipline has held since #2712 / #2732 / #3018.
LLM caching in get_llm(). Bounding total ChatOllama instances to N=distinct configs would make leak shapes architecturally impossible. Orthogonal optimization, deferred — adds complexity around settings invalidation and multi-tenant isolation.
Pre-commit hook flagging get_llm() callers without close(). Useful in principle, deferred — high false-positive risk (caller-passed LLMs, lazy-init holders, factory-returned LLMs all legitimately don't close). Needs a careful design.
Dedicated /api/v1/health/fd diagnostic with eventpoll-inode dedupe. PR #3033 stalled at a basic version (Windows + RLIM_INFINITY bugs); PR #3036 added utilities/fd_monitor.py for cross-platform FD reading. A type/inode-breakdown extension is feasible but deferred until an active leak hunt actually needs it.
app_logs (ResearchLog) retention setting + scheduled cleanup job. Identified in Round 9; the only audit finding that wasn't refuted but also isn't impactful enough today. Trigger to do this work: a user reports the SQLCipher DB growing >100 MB and complains about query slowdown, OR a self-hosted instance keeping research logs for >1 year sees DB bloat, OR the metrics dashboard starts noting research-detail page load slowdown traced to app_logs joins. Implementation sketch: add logs.research_log_retention_days to defaults/default_settings.json (default 0 = disabled, preserves current behavior; e.g. 30 to keep last 30 days). Extend the existing BackgroundJobScheduler in scheduler/background.py (which already runs cleanup_inactive_users hourly and _reload_config every 30 min) with a daily _cleanup_old_research_logs job that deletes ResearchLog rows older than the retention window. Skip rows belonging to favorited / starred researches if a flag exists. ~30 LOC + a regression test that inserts old rows, triggers the job, asserts old rows are deleted and recent ones survive. Add changelog.d/<id>.feature.md.

Glossary

_owns_llm — instance flag set in __init__ to True when the class fetched its own LLM via get_llm(), False when an LLM was injected by the caller. Gates whether close() actually closes the LLM.
safe_close(resource, name) — helper in utilities/resource_utils.py that calls resource.close() inside a try/except, logging on failure. Never raises. Used in every finally block.
_ldr_closed — sentinel attribute set on inner httpx clients by _close_base_llm to make the function idempotent. Checked with is True (not truthy) so Mock objects without a spec don't trip the guard.
eventpoll FD — Linux a_inode file descriptor type for epoll_create'd kernel objects. Each asyncio event loop registers one. Leaked AsyncClients hold them via the loop's selector.

44 KiB Raw Blame History Unescape Escape