* docs(resource-cleanup): Round 9 audit results + conditional deferred fixes
Capture the Round 9 broader-resource-audit results so:
- Future contributors don't re-audit the same paths
- If a relevant production symptom ever appears, the doc points
directly at a pre-thought-through conditional fix
Round 9 ran two passes of three parallel agents looking for resource
leaks BEYOND FDs (memory/cache growth, thread/lock lifecycle, DB state
hygiene). Round 1 produced six HIGH-confidence findings; Round 2
verification refuted four of them and downgraded one.
Added to the audit ledger (next to the existing Wave 7 entry):
- Refuted findings with WHY they were refuted:
- @cache on get_available_providers (called with None — hashable,
cardinality 1; dicts would raise TypeError, not cache silently)
- ThreadLocalSession identity-map growth (expire_on_commit=True
default clears the map on every commit)
- token_usage table unbounded growth (design-intentional permanent
audit table; time-series compound indexes; /api/context-overflow
queries historical windows)
- search_calls table unbounded growth (same shape and verdict)
- Three per-user lock dicts (_user_init_locks, _user_locks,
_user_critical_locks): technically correct that they never clean
up on user delete, but ~296 bytes per user × 3 dicts = ~900 KB
ceiling at 1000 users. Practically negligible.
- app_logs (ResearchLog) table — the one finding that survived
verification as a real but small concern. No auto-retention; only
cleaned by cascade-delete when parent Research row is manually
removed. For users keeping all research, logs accumulate.
Added to "Intentionally not done (deferred)":
- app_logs retention setting + scheduled cleanup job. Includes the
trigger conditions that would justify the work and the
implementation sketch (settings key, daily APScheduler job,
regression test, news fragment).
- Per-user lock dict cleanup on user delete. Cosmetic; included with
trigger conditions and one-line-per-file sketch so it's actionable
if multi-user deployments ever see it.
No code changes. Documentation only.
* fix(resource-cleanup): pop per-user lock-dict entries on user close
Three module-level per-user lock dicts had no removal hook, so each
accumulated one ``threading.Lock`` entry per username over the
process lifetime:
- ``_user_init_locks`` in ``database/library_init.py`` (serializes
collection-init check-then-insert)
- ``_user_locks`` in ``database/backup/backup_service.py`` (per-user
backup serialization)
- ``_user_critical_locks`` on ``QueueProcessorV2`` (per-user
count-then-start critical section)
The ceiling was ~296 bytes/entry × 3 dicts ≈ ~900 bytes per user
across all three — bounded by total user count, microscopic relative
to the eventpoll FD leak that motivated the original investigation,
but real for long-lived multi-user instances with user-account
churn. Identified in Round 9 of the broader resource-leak audit
(see docs/developing/resource-cleanup.md).
The fix:
- Each module now exposes a ``pop_user_*_lock(username)`` function
(or method for the QueueProcessor instance dict) that pops the
entry under the existing per-dict lock.
- ``connection_cleanup._pop_per_user_locks(username)`` is a shared
helper that lazy-imports and calls all three with individual
try/except blocks so a failure in one doesn't skip the others.
- The helper is invoked from both user-close paths:
- ``cleanup_idle_connections`` (the 5-minute sweeper) after
``db_manager.close_user_database``
- ``web/auth/routes.py`` after the logout and password-change
``close_user_database`` calls
Pattern mirrors the existing per-user cleanup in those code paths
(scheduler.unregister_user, session_password_store.clear_*).
Tests in ``tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks``:
- Unit test: populate all three dicts, call the helper, assert all
three entries are gone.
- Idempotency: pop on a never-registered user must not raise.
- Integration: ``cleanup_idle_connections`` actually invokes the
helper for each user it closes (verified via the library-init dict
for "alice").
Doc updated: the entry that R9A2 identified as "technically correct,
practically negligible" is moved from the audit ledger's findings
list into a "Fixed in this PR" subsection; the matching
deferred-fix entry in "Intentionally not done" is removed.
Adds a towncrier bugfix fragment.
* review: address self-review findings on PR #4077
Four fixes from a multi-round agent review of this PR:
1. Move ``_pop_per_user_locks`` outside the ``close_user_database``
try/except in ``connection_cleanup.py``. Previously the pop was
inside the same try block, so a DB-close failure — the path that
``test_close_failure_does_not_abort_loop`` already exercises —
skipped the pop and leaked the lock-dict entry. New test
``test_pop_runs_even_when_close_user_database_fails`` pins this.
2. Bump three ``logger.debug`` to ``logger.warning`` in
``_pop_per_user_locks``. Matches the sibling scheduler-unregister
error handler in the same module; debug-level silently masked
lock-dict accumulation across cycles.
3. Doc accuracy fix in ``docs/developing/resource-cleanup.md``.
The entry called all three dicts "module-level" but
``_user_critical_locks`` is an instance attribute on
``QueueProcessorV2``. Rewrote to distinguish module-level dicts
from the instance attribute and to note the pop now runs outside
the close try/except.
4. Integration test pre-populates all three lock-dict entries and
asserts all three are absent post-cleanup, not just
``_user_init_locks``. Switched the test username from "alice"
(used by other tests in the module) to a dedicated sentinel.
Tests: 23/23 in ``tests/web/auth/test_connection_cleanup.py``.
Race-condition concerns flagged in Round 1 (TOCTOU between pop and
``_get_user_*_lock``) were verified in Round 2 to be either guarded
by Python's reference-counted lock semantics (``with lock:`` keeps
the original lock object alive after dict pop) or bounded to a +1
race window across multiple browser sessions — not ship-blocking.
The narrow ``library_init`` race surfaces as a propagated
``IntegrityError`` on collection insert, not silent corruption.
44 KiB
Resource cleanup in LDR
This document captures how LDR manages process-level resources (DB
connections, HTTP clients, file descriptors, threads) and the reasoning
trail behind the current model. It exists because file-descriptor
exhaustion has been a recurring class of bug in LDR, and the journey
of fixing it — what's been tried, what worked, what was ruled out — is
not reconstructable from git log alone.
If you're contributing code that holds a network connection, a database
session, an LLM client, or a thread, read this before adding __del__,
weakref.finalize, or a context manager.
Current model
Database connections
- One shared per-user
QueuePool. No per-thread engines. Pool sizing:pool_size=20,max_overflow=40, with periodicdispose()every 30 minutes. - SQLCipher is decrypted once per connection-open.
PRAGMA keytakes ~0.2 ms; pool reuse keeps that off the hot path. - Engines are created at login, closed at logout (or process exit via
the registered
atexitshutdown). - Background threads (research workers, metric writers, news scheduler jobs) use the same per-user pool — they no longer maintain a separate thread-engine system.
See ADR-0004 for the QueuePool-vs-NullPool decision and PR #3441 for the per-thread-engine removal.
LLM wrappers
LDR wraps every LLM in ProcessingLLMWrapper (and optionally
RateLimitedLLMWrapper) so that callers see a uniform interface and
the project owns the close path:
caller -> ProcessingLLMWrapper.close()
-> _close_base_llm(base_llm) in utilities/llm_utils.py
-> for ChatOllama:
sync httpx client (ollama.Client._client) .close()
async httpx client (ollama.AsyncClient._client) .aclose()
-> for ChatOpenAI / ChatAnthropic:
no close (those use @lru_cache'd shared httpx clients)
Key invariants:
ChatOllamais the only provider where_close_base_llm()actually closes anything. ChatOpenAI and ChatAnthropic share LRU-cached httpx clients across instances; closing them would break other live LLMs.- Both
_client(sync) and_async_client(async) are released — the async side is exercised by everyainvoke()call (langgraph agents, modular strategies). Closing only the sync side leaks the async transport per call (root cause of #3816). - The function is idempotent via an
_ldr_closedsentinel on the inner httpx clients. - The async close uses
asyncio.run(client.aclose())only when no event loop is currently running. When called from inside async code it skips and leaves the close to the loop's owner.
Search engines
BaseSearchEngine.close()is the single entry point and cascades into_preview_filtersand_content_filters. That cascade is what releases per-engine LLMs (e.g.,JournalReputationFilter.model), SearXNG sessions, and other filter-held resources.- Search-engine cleanup happens at the per-research finally block in
web/services/research_service.py:run_research_process()and at the programmatic API entry points inapi/research_functions.py. - The
_owns_llmflag pattern (introduced in #2712) tracks whether a filter or engine constructed its own LLM (and thus owns it) versus borrowed one from a caller (and must not close it).
Thread lifecycle
@thread_cleanup(decorator onrun_research_processand similar workers) ensures thread-local DB sessions are released even on abnormal exits.cleanup_current_thread()is called from Flask teardown, the queue processor, the auth flow, and the RAG routes — six tier-1 paths in total.- Background threads are daemon threads; the process exit handles any thread that did not clean up gracefully.
Conventions
- Use
safe_close(resource, "human name")fromutilities/resource_utils.pyfor every cleanup. Never bare.close()in afinally(it can mask the original exception). - Prefer
try/finallyover__del__. Python doesn't guarantee finalization order at interpreter exit;__del__interacts subtly with reference cycles andweakref. - Track ownership explicitly with
_owns_llm(or analogous flag) when a class accepts an injected resource that may or may not be its own. - News fragments (
changelog.d/<id>.bugfix.md) are required for any user-visible cleanup behavior change — seechangelog.d/README.md.
How to close X correctly
| You're holding | Do this |
|---|---|
A ChatOllama (raw or wrapped) |
Call wrapper.close() in a finally, or pass to safe_close(wrapper, "..."). The wrapper chain handles both sync and async httpx clients. |
| A search engine you constructed | safe_close(engine, "...") in finally. The engine's close() cascades into preview/content filters. |
| A holder class with an LLM | Add a close() method, gate the LLM close on self._owns_llm, document who calls it. Don't add __del__. |
| A long-lived service holder (news scheduler, etc.) | Wrap construction in try/finally at the cycle boundary. Don't store the LLM if you can recreate it cheaply. |
| A DB session | Use with get_user_db_session(username) as session:. Don't bypass via get_settings_manager(username=...) without owns_session=False (see #3023). |
| An asyncio event loop | Use the existing one. If you genuinely need a new one (background thread fallback), call loop.close() in a finally — see news_strategy.py for the reference pattern (post-#3018). |
Anti-patterns
These look reasonable but break specific things in this codebase:
- Adding
__del__to a class withclose(). At interpreter exit thelogger,httpx, and event-loop modules may already be torn down.__del__can run after them and raise. Use explicit close in afinallyinstead. - Closing a shared httpx client. ChatOpenAI / ChatAnthropic share
one httpx pool across instances via
@lru_cache. Closing it kills every other live LLM in the same process. The Ollama check in_close_base_llmexists exactly to gate this. - Truthy idempotency sentinels on Mock objects.
Mock()without aspecauto-generates child Mocks for any attribute access, sogetattr(client, "_ldr_closed", False)returns a truthy Mock and short-circuits the close. Always useis True/is Nonechecks for sentinels — see the pattern in_close_base_llm. - Skipping
super().close()in a search-engine subclass.BaseSearchEngine.close()is what cascades into preview/content filters. Override it without calling super and you leak every filter's resources (this was a Copilot finding on #3818). - Treating
asyncio.run()as safe inside an event loop. It raisesRuntimeErrorif called from a thread that already has a running loop. The pattern in_close_base_llmis: detect a running loop withget_running_loop(), skip the async close in that branch (the loop owner will close), only callasyncio.runin the no-loop case.
History
The FD-leak campaign spans roughly four months of iterative work. Each fix narrowed the remaining surface; each subsequent issue was found in a corner the previous wave hadn't touched.
Wave 1 — initial leak inventory (Jan 2026)
- #1832, #1849, #1856, #1860 — first comprehensive sweep. Identified
seven distinct leak sources:
auth_dbengine,download_managementDB, search cache, subprocess zombies, HTTP sessions inSemanticScholarSearchEngineandBaseDownloader, Socket.IO threads. Established context-manager +try/finallypatterns. Added a pre-commit hook to catch missing cleanup at commit time.
Wave 2 — thread-local engine accumulation (Mar 2026)
- #2495 — diagnosed that Flask's teardown only cleaned the
request-scoped
g.db_sessionwhile a separate_thread_enginesdict accumulated NullPool engines per thread, leaking ~3 FDs per request. Addedcleanup_current_thread()across six tier-1 paths. - #2591 — dead-thread engines (when threads crashed they left
engines behind) plus
stream=Truesocket holds in the generic downloader. Added a throttled dead-thread sweep, removedstream=True, raised the Docker ulimit from 1024 to 65536.
Wave 3 — LLM wrapper lifecycle (Mar 2026)
- #2708 — diagnosed
ChatOllama→httpx.Clientchains with no__del__. With the news scheduler triggering 50–300quick_summary()calls per hour, a 1024-FD container exhausted in 3–4 hours. Wrapped four programmatic API entry points intry/finallywith explicit close. - #2712 — extracted
close_llm()to a shared utility. Addedclose()and_owns_llmtoNewsAnalyzer,HeadlineGenerator,TopicGenerator,JournalReputationFilter,DomainClassifier,GitHubSearchEngine,IntegratedReportGenerator,ElasticsearchSearchEngine, and the benchmark graders. - #2756 — wrapped bare
.close()calls infinallyblocks withsafe_close()to prevent masking the original exception. - #2732 — moved
close()intoProcessingLLMWrapperandRateLimitedLLMWrapperdirectly; eliminated the standaloneclose_llm()free function.
Wave 4 — DB session leaks + per-call patterns (late Mar / early Apr 2026)
- #3018 —
get_settings_manager(username=...)was bypassingg.db_sessionand creating QueuePool sessions per-thread; live diagnostics showed 321 sockets allocated, only 66 in use.DownloadService.close()leaked the innerSettingsManagersession. Also fixedTopicBasedRecommender._create_recommendation_card()(per-call LLM with no cleanup) and anasyncio.new_event_loop()innews_strategy.pythat never closed. - #3204 — test fixtures using
returninstead ofyieldleft engines un-disposed. Migrated 8 test files toyield+engine.dispose().
Wave 5 — DB pool architecture (Apr 2026)
- #3340 — kept QueuePool but minimized FDs (
pool_size=1,max_overflow=2, periodicdispose()every 30 min). - #3337 (closed) — proposed switching SQLCipher engines to NullPool for zero persistent FDs. Superseded by #3441.
- #3441 — removed per-thread NullPool engines entirely
(~2,100 lines of sweep logic deleted) and routed metrics through a
single shared per-user QueuePool with bounded sizing
(
pool_size=20,max_overflow=40). - #3477 — created ADR-0004 capturing the final pool model and updated stale FD calculations across docs.
Wave 6 — async client close (May 2026)
- #3818 (open, declined for merge) — proposed session-pooling
around
safe_get/safe_postto address #3816. The session refactor is reasonable in isolation, but the lsof in #3816 showed ~72% of leaked FDs asa_inode [eventpoll]selectors, not HTTP request sockets — pointing at async-client transports rather thansafe_getcallers (whose response bodies were already consumed). See the PR comment for the full reasoning. - #3855 — extended
_close_base_llm()to also closeChatOllama._async_client(the actual gap the lsof pointed to). Added theIntegratedReportGeneratorclose that was missing from the per-researchfinallyblock. Idempotency via_ldr_closedsentinels on the inner httpx clients.
Wave 7 — async close inside a running loop (May 2026)
- #4047 —
_close_base_llm's async branch had a documented "skip if a loop is running; loop owner closes" path. No loop-owner cleanup code existed anywhere in the project, so when the close was called inside an active asyncio loop the innerhttpx.AsyncClient(and itsepoll_createFD) was silently abandoned. Reproduced in production: a v1.6.10 single-host Ollama container reached 1024 FDs with the /proc histogram showing 929anon_inode:[eventpoll](91%) — the same FD class as #3816 but in a code path #3855's fix didn't cover. The fix runs the async close in a brief daemon thread that owns its own loop, soasyncio.run(aclose())works regardless of the caller's loop state. A bounded 5-secondjoinkeeps the cleanup from blocking shutdown when the Ollama server is unresponsive; on timeout_ldr_closedis left unset so a later call retries, and a WARNING surfaces so the situation is observable instead of silent. - Healthcheck pidfd leak (same PR). Dockerfile's
HEALTHCHECK CMD python -c "... urllib.request.urlopen(...)"had notimeout=argument; Docker's 10s timeout SIGKILL'd thesh -cparent but the python child was reparented to PID 1 and hung forever, each surviving child holding apidfd+ TCP socket against the app. Same /proc dump showed 64anon_inode:[pidfd](6%) from this. Addingtimeout=8lets the child return/raise inside Docker's budget so it exits cleanly and gets reaped.
Audit ledger — what the broader sweep checked
The PR included a wide audit (50+ parallel exploration agents across
seven rounds plus direct /proc inspection) to catch any other latent
FD leak. To save the next contributor from re-running the same checks,
here is the full ledger:
Checked and confirmed clean (no action needed)
- Non-Ollama LLM providers. xAI, Google Gemini, OpenRouter, IONOS,
LM Studio, llama.cpp HTTP, DeepSeek, OpenAI-compatible endpoint, plus
OpenAI and Anthropic themselves. All extend
ChatOpenAIorChatAnthropic, which use@lru_cache'd shared httpx clients._close_base_llm's short-circuit on these classes is correct by design — closing them would brick every other live LLM in the process. - HTTP session lifecycle. Six instantiation sites checked
(
PricingFetcheraiohttp,LDRClientSafeSession,BaseDownloader,SemanticScholarSearchEngine,MCPClient,CostCalculator). All context-managed viawithor owned by a class with a pairedclose()and__exit__. - subprocess / pidfd. Three call sites, all
subprocess.run()(blocking). Nosubprocess.Popenpaths anywhere insrc/. NoProcessPoolExecutor. No FD leak surface beyond the healthcheck child, already addressed by the Dockerfiletimeout=8change. - asyncio event loops. Zero raw
asyncio.new_event_loop()outside safeasyncio.run()patterns. The historical leak innews_strategy.py(#3018) is still fixed. - File handles. All 37
open()call sites are insidewith. Zero bare opens.tempfile.NamedTemporaryFile/TemporaryDirectoryall context-managed. - SocketIO connect/disconnect. Non-disconnect handlers
(
subscribe,unsubscribe,connect) do not acquire DB sessions (an early-round agent claim that they did was refuted on re-read). The__socket_subscriptionsdict is cleaned on disconnect. The PID-1 FD breakdown showed only 3 sockets out of 1024 — socket accumulation is not a contributor.
Flagged by audit, then verified NOT a real FD leak
- OllamaEmbeddings httpx. LDR imports the deprecated
langchain_community.embeddings.OllamaEmbeddings, which usesrequests.post()per call — no persistent httpx client, no_client/_async_clientattribute. Direct introspection:[a for a in dir(e) if 'client' in a.lower()]returns[]. Zero FDs per call. An audit agent confused this class withChatOllama, which is a different class. See the migration note in the next subsection — this changes when langchain forces the move. auth_dbandjournal_qualityengines escapingshutdown_databases().auth_dbusesQueuePool(pool_size=10, max_overflow=20)andjournal_qualityusesStaticPoolwithimmutable=1. Both are bounded and do not grow at runtime. Live/procon the affected container showed only 21 SQLite-related FDs total on PID 1 — well below the ~91-FD ceiling these unmanaged engines could theoretically reach. The kernel reclaims FDs at process exit regardless ofengine.dispose(), and SQLite WAL files auto-checkpoint on next open. Missing dispose at exit is hygiene, not a leak.LibraryRAGServicein three RAG SSE endpoints.rag_routes.py:693, 1054, 1827do construct the service outside the generator and never close it, butLibraryRAGService.close()only sets references toNone— it releases no FDs. FAISS usespickle.load()(not mmap); OllamaEmbeddings holds no FDs per the item above; the SentenceTransformer model+tokenizer mmaps are process-wide singletons. What gets delayed is ~50–200 MB of embedding-model RAM until GC. A memory-pressure question, not the eventpoll FD class this Wave addressed.- Residual
pidfdaccumulation via Playwright fallback — identified in a Round-8 follow-up after the eventpoll fix landed. Live/procon the prerelease container showed ~29 pidfds steady state, growing ~3.6/hour, all targetingPid: -1(children that had exited). Rate was stable during active benchmark execution, ruling out a per-task source. Eight parallel agents converged on the same chain:_check_subscription→quick_summary→FullSearchResults.batch_fetch_and_extract→AutoHTMLDownloaderfallback toPlaywrightHTMLDownloader._fetch_with_playwright. Eachsync_playwright().start()invokesasyncio.create_subprocess_exec()for the Node.js driver (opens a pidfd via Linux'sPidfdChildWatcher); the driver then fails because Chromium is not installed in the productionldrDockerfile stage (onlyldr-testrunsplaywright install --with-deps chromium), and the asyncio child watcher does not promptly close the pidfd on the failed-child exit. CPython 3.14 was confirmed to not use pidfd insubprocess.pyat all (subprocess.run/Popenusewaitpid(WNOHANG)polling), so subprocess-based hypotheses were ruled out. Fixed by PR #3971 (defaultweb.enable_javascript_rendering=false): the fallback short-circuits before any subprocess is spawned, so no pidfd is opened. The PR was motivated by issue #3826 (confusing tracebacks); the FD-leak finding is the second motivation, surfaced here.
Minor findings (not steady-state leaks; worth knowing)
- Daemon threads without explicit shutdown.
search_cache.pycache-cleanup thread,journal_reputation_filter.pybackground fetcher,log_utils.pyqueue processor,parallel_search_engine.pyglobal executor. All daemonized — reaped by the OS at process exit. Not steady-state leaks; no per-request growth. - Abandoned-research thread on socket disconnect. If a client
closes the tab mid-research, the socket subscription is removed but
the research thread keeps running until completion;
_active_research[research_id]is not cleared on disconnect. Not an FD leak; potentially compute/memory waste if the user wanted the research to stop. Out of scope for the FD-leak story.
Future-proofing note — langchain_ollama.OllamaEmbeddings migration
langchain_community.embeddings.OllamaEmbeddings is deprecated
("will be removed in langchain 1.0.0", per the import warning). Its
replacement, langchain_ollama.OllamaEmbeddings, does carry
_client and _async_client attributes — same shape as
ChatOllama. Verified by direct introspection:
langchain_ollama.OllamaEmbeddings client attrs:
['_set_clients', 'async_client_kwargs', 'client_kwargs',
'sync_client_kwargs']
Has _client? True
Has _async_client? True
So: today's LDR is FD-safe on the embeddings side by accident
(using the deprecated class). The day the deprecated class is removed
upstream and LDR migrates, the in-running-loop eventpoll FD leak
class will reappear for embeddings unless _close_base_llm is
generalized to introspect either a chat model or an embeddings
instance with the same _client / _async_client shape. The
extension is a small type-check broadening, not a redesign.
Round 9 — broader resource audit (May 2026)
Once the FD-leak classes were closed, a follow-up audit looked for other slow-growth patterns that wouldn't trip the FD counters but could still degrade a long-running container: memory and cache growth, thread / asyncio Task / lock lifecycle, DB state hygiene beyond connections. Three parallel agents per round, two rounds (Round 1 hypothesis generation, Round 2 fact-check), captured here in verified form so the next contributor doesn't re-derive the same conclusions.
Refuted (false positives from Round 1, verified in Round 2)
@cacheonget_available_providers(config/llm_config.py:158). Round 1 claimed unbounded cache growth if the function were called with differingsettings_snapshotdicts. Round 2 verified: dicts are unhashable, so@cachewould raiseTypeErroron them, not silently grow. In practice the call sites passsettings_snapshot=None(hashable, cardinality 1). Not a leak.- Thread-local Session identity-map growth
(
database/thread_local_session.py). Round 1 claimed long-running research threads would accumulate ORM objects in the per-thread Session's identity map. Round 2 verified: SQLAlchemy's defaultexpire_on_commit=Trueclears the identity map at every commit; the codebase commits periodically. Bounded by typical query volume, not unbounded by uptime. token_usagetable unbounded growth. Append-only per LLM call with no TTL or retention job. Round 2 verified: feature by design. Schema has compound time-series indexes (idx_token_research_timestamp, etc.);/api/context-overflowand/metrics/api/metricsexplicitly query historical windows for cost analysis. The table is a permanent audit trail by intent. Adding retention would break the metrics dashboards.search_callstable unbounded growth. Same shape and same verdict — compound time-series indexes confirm intentional design as a permanent search-analytics record.
Fixed in this PR — three per-user lock dicts
- Three per-user lock dicts —
_user_init_locksand_user_locksare module-level dicts indatabase/library_init.pyanddatabase/backup/backup_service.pyrespectively;_user_critical_locksis an instance attribute on theQueueProcessorV2singleton inweb/queue/processor_v2.py. Each stored onethreading.Lockper username with no removal hook. Bounded ceiling (~296 bytes/entry × 3 dicts at 1000 users = ~900 KB), so not urgent — but easy to fix cleanly. The two module-level dicts now exposepop_user_init_lock/pop_user_lockfunctions; the queue processor exposes the equivalent as an instance methodqueue_processor.pop_user_critical_lock. A shared_pop_per_user_locks(username)helper inconnection_cleanup.pycalls all three with lazy imports and individual try/except (WARNING-level so dict accumulation is observable, matching the sibling scheduler-unregister error path). The helper is invoked unconditionally — outside theclose_user_databasetry/except so it still runs when the DB close itself fails — in both the idle-connection sweeper (connection_cleanup.py:cleanup_idle_connections) and the logout / password-change paths (web/auth/routes.py). Tests intests/web/auth/test_connection_cleanup.py::TestPopPerUserLockscover the helper directly and through the idle-close path.
Real but small (survives verification)
app_logs(ResearchLog) table — no automatic retention. Grows by ~100s-1000s of rows per research. Cleaned only via cascade-delete when the parentResearchrow is deleted manually. Unliketoken_usage/search_calls, this table has no UI dashboard or time-series API consuming it — it's debug context for a specific research session, not an analytics record. For users who keep all research, logs accumulate indefinitely. See "Intentionally not done (deferred)" for the retention design when a symptom report justifies it.
Debugging FD leaks — playbook for the next one
When the next FD leak shows up (and there will be one, eventually), this section is the shortcut. It captures the actual diagnostic flow that worked across Waves 6 and 7 so a future contributor doesn't have to re-derive it from the symptom.
0. Symptoms that mean "investigate this as an FD leak"
- Tracebacks like
OSError: [Errno 24] Too many open files, typically fromselectors.DefaultSelector()in werkzeug orsend_from_directoryin Flask. These are usually the first visible failure. - Browser-side MIME-type errors on static assets (
text/htmlinstead oftext/css/application/javascript). These are downstream of FD exhaustion — Flask can't open the static file, returns an HTML 500, and the browser refuses to apply it because ofX-Content-Type-Options: nosniff. High FD count (N) — approaching system limitwarnings fromweb/auth/connection_cleanup.py(fires at FD > 800 every 5-minute cleanup tick).- Container health turns
unhealthybecause the healthcheckurlopenhangs on a process that no longer has FDs to accept connections.
1. Capture diagnostic state BEFORE restarting
The single most important rule: the snapshot does not survive a container restart. Every minute spent on the live broken container is worth an hour of after-the-fact agent guessing. Save the diagnostic output to a host-side file first.
One-shot host-side snapshot (works even when the container is
FD-starved enough that docker exec can't fork)
# Run on the Docker host. No docker exec required.
P=$(docker inspect -f '{{.State.Pid}}' <container-name>)
sudo bash -c "
echo '=== Total FDs ==='
ls /proc/$P/fd | wc -l
echo '=== FD-type histogram (digits collapsed) ==='
ls -l /proc/$P/fd | awk '{print \$NF}' \
| sed -E 's/\[[0-9]+\]/[N]/g; s/[0-9]{4,}/NUM/g' \
| sort | uniq -c | sort -rn | head -30
echo '=== Counts by category ==='
printf 'socket: %s\n' \$(find /proc/$P/fd -lname 'socket:*' | wc -l)
printf 'pipe: %s\n' \$(find /proc/$P/fd -lname 'pipe:*' | wc -l)
printf 'eventpoll: %s\n' \$(find /proc/$P/fd -lname '*eventpoll*' | wc -l)
printf 'pidfd: %s\n' \$(find /proc/$P/fd -lname '*pidfd*' | wc -l)
printf 'WAL files: %s\n' \$(find /proc/$P/fd -lname '*-wal' | wc -l)
printf 'SHM files: %s\n' \$(find /proc/$P/fd -lname '*-shm' | wc -l)
printf '.db files: %s\n' \$(find /proc/$P/fd -lname '*.db' | wc -l)
" | tee /tmp/ldr-fd-snapshot.txt
Why host-side: reading the container's PID 1 FDs from inside the
container requires the same UID that started PID 1. The Dockerfile
entrypoint runs as root then setprivs to ldruser, so the
docker exec shell (ldruser) cannot readlink PID 1's FDs even though
it can count them. Host root via sudo sidesteps the UID check.
Inside-container alternative (if the host is locked down)
docker exec --user 0 <container-name> sh -c '...same body...'
--user 0 runs the exec'd shell as root inside the container,
sidestepping the same UID restriction.
2. The lookup table — FD type → likely source
| Dominant FD type | Likely source | Diagnostic deep-dive |
|---|---|---|
anon_inode:[eventpoll] |
asyncio event loop or httpx.AsyncClient selector. Each leaked async client = +1. |
Grep asyncio.create_subprocess, httpx.AsyncClient, _async_client, ainvoke. See Wave 6, Wave 7. |
anon_inode:[pidfd] |
asyncio.create_subprocess_* or multiprocessing.Process (uses pidfd_open on Linux). |
Read /proc/PID/fdinfo/N for each pidfd; the Pid: line shows the target (-1 = child already exited). |
socket:* (lots) |
HTTP keep-alive, SSE streams, SocketIO connections. | Cross-reference with /proc/PID/net/tcp states; check Round 7 R7A8 patterns. |
pipe:* (lots) |
subprocess.run/Popen with stdout=PIPE, multiprocessing IPC, loguru queue. |
Check subprocess.run sites and APScheduler executor type. |
REG *-wal / *-shm |
SQLCipher in WAL mode. Each pooled connection holds ~3 FDs. | See ADR-0004. If growing without bound, the periodic engine.dispose() is silently failing. |
REG /data/*.db (lots) |
Plain SQLite connections from an engine without bounded pool. | Audit create_engine sites (R7A6 caught two unmanaged ones). |
REG /home/...mmap... |
Memory-mapped model weights or FAISS indexes — usually process-wide singletons (not leaks). | Check whether the count grows per request. If yes → real leak. |
3. Pinpointing the source for a specific FD type
Eventpoll
anon_inode:[eventpoll] always comes from EpollSelector — created
by every asyncio loop and every httpx.AsyncClient. Grep:
grep -rn 'asyncio.create_subprocess\|httpx.AsyncClient\|_async_client' src/
Then check whether each site explicitly closes the client. The Wave 7
fix to _close_base_llm is the reference pattern for "close async
httpx even when called inside a running loop."
Pidfd
Pidfds expose their target PID via fdinfo:
# Run inside the container (or via docker exec --user 0):
for fd in $(ls /proc/1/fd 2>/dev/null); do
link=$(readlink /proc/1/fd/$fd 2>/dev/null)
case "$link" in
*pidfd*)
tpid=$(awk '/^Pid:/ {print $2}' /proc/1/fdinfo/$fd 2>/dev/null)
if [ "$tpid" -gt 0 ] 2>/dev/null; then
cmd=$(tr '\0' ' ' < /proc/$tpid/cmdline 2>/dev/null | cut -c1-80)
echo "fd=$fd alive pid=$tpid : $cmd"
else
echo "fd=$fd ORPHAN (child exited; pidfd not closed)"
fi
;;
esac
done
A high "ORPHAN" count = something called asyncio.create_subprocess_*
or multiprocessing.Process, the child exited, but the pidfd in the
parent was never closed. Common in Round-8: Playwright's Node.js
driver subprocess failing because Chromium isn't installed in the
production image.
Note: CPython 3.14's subprocess.py does not use pidfd at all
(waitpid(WNOHANG) polling instead). So pidfds in a 3.14 process
necessarily come from asyncio or multiprocessing, not from
subprocess.run / Popen.
Syscall-level pinpointing with bpftrace (mysterious cases)
When the source isn't obvious from the FD type, bpftrace can record
the Python stack of every relevant syscall on the live process. This
would have caught the Playwright leak in seconds instead of two rounds
of agent exploration. Requires kernel headers and bpftrace installed
on the host (NOT the container — bpftrace runs in host kernel space
and can target a host PID by number):
# Find host-side PID of container's PID 1
P=$(docker inspect -f '{{.State.Pid}}' <container>)
# Trace every pidfd_open syscall, grouped by user-stack:
sudo bpftrace -e "tracepoint:syscalls:sys_enter_pidfd_open
/pid == $P/ { @[ustack(perf)] = count(); }"
# Same idea for epoll_create / epoll_create1 (eventpoll FDs):
sudo bpftrace -e "tracepoint:syscalls:sys_enter_epoll_create1
/pid == $P/ { @[ustack(perf)] = count(); }"
Let it run for a minute, then Ctrl-C; you get a histogram of every
unique stack that triggered the syscall, ranked by frequency. The hot
stacks are your culprits. Works for any syscall — useful future
candidates: socket, inotify_init1, timerfd_create,
memfd_create.
WAL/SHM
engine.dispose() is expected to release these. If the count climbs
across the periodic 30-minute dispose cycles, the dispose is silently
failing. The observability commit (f86c3f7af) elevates dispose
failures to WARNING — check the logs for Error disposing engine for <user>.
4. Existing instrumentation already in the codebase
-
_count_open_fds()atsrc/local_deep_research/web/auth/connection_cleanup.py:50— fast/proc/self/fd-based counter with macOS fallback. Reusable. -
Resource monitor: open_fds=…debug log line atconnection_cleanup.py:184, fires every 5-minute cleanup tick. -
High FD count (N)WARNING atconnection_cleanup.py:190when FDs exceed 800. The single most useful production signal. -
fd_monitor.py(PR #3036) — cross-platform helper used by diagnostic endpoints. -
tests/manual_smoke/test_fd_smoke.py(PR #3930, thetest/manual-fd-smoke-suitebranch) — opt-in pytest harness that runs the close cycle N times and asserts the FD count stays flat. Gated byRUN_MANUAL_SMOKE=1; not part of the default CI run because it needs a live Ollama and produces noise. Extend this suite when you ship a leak fix. -
In-CI FD-growth canaries in
tests/utilities/test_close_base_llm.py. These run on every PR:TestCloseBaseLLMRealHttpxAsync::test_no_fd_growth_across_repeated_close_cycles— guards the eventpoll FD class against Wave-6-shaped regressions.TestCloseBaseLLMRealHttpxAsync::test_no_fd_growth_when_closed_inside_running_loop— guards the Wave-7-shaped in-running-loop skip regression.TestAsyncioSubprocessFDBaseline::test_no_fd_growth_across_asyncio_subprocess_cycles— guards the pidfd FD class against the child-watcher leak shape.TestAsyncioSubprocessFDBaseline::test_no_fd_growth_when_subprocess_fails_to_exec— pins the exact Wave-7-pidfd shape (failed exec, child watcher must still clean up). Catches platform-level regressions in Python's asyncio child watcher.
All four use
_open_fd_count()(also in that file) which reads/proc/self/fdon Linux with anRLIMIT_NOFILEfallback on macOS. Slack is +2 FDs across 5–10 iterations. A real per-cycle leak would blow past that.
4a. Development-time detection (catch leaks at test time)
Production /proc inspection catches leaks after they ship. The cheapest catch is to make Python itself complain at test time. Three Python features cooperate to surface unclosed resources during a normal test run — none of them were on by default during Waves 6 and 7, which is part of why those leaks made it to production.
PYTHONASYNCIODEBUG=1 plus -W default::ResourceWarning. When
asyncio debug mode is on, unclosed transports/coroutines emit a
ResourceWarning at GC time. The -W filter makes Python actually
display them. Together they would have caught the Wave 7 in-running-loop
skip: every leaked httpx.AsyncClient produces a visible warning the
first time the GC sweeps after the test fixture exits. From
the asyncio dev docs:
When a transport is no longer needed, call its
close()method to release resources. ... If a transport or an event loop is not closed explicitly, aResourceWarningwarning will be emitted in its destructor.
To enable in pyproject.toml [tool.pytest.ini_options]:
filterwarnings = [
"default::ResourceWarning",
]
env = [
"PYTHONASYNCIODEBUG=1",
]
Or in CI for a one-off check:
PYTHONASYNCIODEBUG=1 python -W default::ResourceWarning -m pytest tests/
For a CI gate that fails on any leak (more aggressive — use only on a targeted subset of tests, not the whole suite, because third-party libraries also emit ResourceWarning):
filterwarnings = [
"error::ResourceWarning",
]
python -X dev. Enables Python's dev mode, which turns on a
bundle of safety checks including ResourceWarning display, asyncio
debug mode, and warnings as default. Cheap one-flag alternative for
local development; not recommended in production (overhead).
python -X dev -m pytest tests/
psutil for portable FD counting in tests. Our in-codebase
_count_open_fds uses /proc/self/fd (Linux-fast path, macOS
fallback). psutil is the cross-platform alternative many other
projects use:
psutil.Process().num_fds()— Linux/BSD only; same number as our helper.psutil.Process().open_files()— list of named files; gives the paths forREG-type FDs (e.g.,/data/*.db-wal).psutil.Process().connections(kind='all')— sockets visible to the process, with state and remote address.
These are useful in unit tests when you want to assert "no new file
of pattern X is open after the close path runs," and they work on the
macOS dev environments without /proc.
**For tracking which Python object holds a leaked FD: tracemalloc
objgraph.** Not FD tools per se, but when a leak is reproducible, take atracemallocsnapshot before and after the suspect operation and diff — the new allocation is usually the wrapper holding the FD.objgraph.show_backrefs([leaked_obj])then renders the reference chain keeping it alive. Both are pure-Python and zero-dependency.
5. Why we don't have an automated FD-growth test in CI
Several reasons, weighed during Wave 6 and Wave 7:
- Per-request FD growth is hard to assert. Many legitimate request paths transiently open and close FDs; a noisy delta is the norm. Distinguishing "leak" from "in-flight" requires a stable quiescent state, which a CI test doesn't naturally provide.
- The CI environment spawns its own subprocesses. pytest, coverage, gunicorn workers (for some test variants), gh-runner cleanups — all add their own FDs that pollute the count.
- PID-namespace differences between CI and prod. Counts you observe in a CI container's /proc are not directly comparable to a production container's /proc; the subprocess sources differ.
- The actual leaks have been "slow drip" patterns that need
hours of uptime to surface. Wave 6's eventpoll leak took multiple
hours of
ainvokecalls to reach the 1024 cap. CI can't run for hours per PR.
What works instead:
- Per-leak unit-level regression tests. Each fix in Waves 1-7
landed with a targeted test that exercises the specific close path
(e.g.
tests/utilities/test_close_base_llm.py::test_no_fd_growth_when_closed_inside_running_loop). These are fast, deterministic, and run on every PR. - Opt-in manual smoke suite (
RUN_MANUAL_SMOKE=1) for the end-to-end "run-the-cycle-N-times-and-count" pattern, used during investigation but not on every CI run. - Production /proc inspection when a leak is suspected — the playbook above. Faster than CI for the long-drip patterns.
If you want to add a long-run CI job, the right shape would be a nightly workflow (not per-PR) that:
- Builds the production Docker image.
- Starts it with a synthetic user account and ~5 news subscriptions.
- Lets it idle for 20-30 minutes.
- Runs the host-side snapshot script above.
- Asserts
total FDs < Nandeventpoll < Mandpidfd < K, where the thresholds are tuned for the steady-state ceilings the codebase intentionally permits (auth_db pool, etc.).
That would have caught Waves 6, 7 in a single nightly cycle instead of through a user crash report. The reason it doesn't exist yet is cost (a half-hour idle job per night per platform) and the lack of a clear baseline; the Round-8 finding is the moment to consider adding one if you want to invest the maintenance time.
6. Lookup: which Wave fixed which leak class
| FD class | Wave / PR | Root mechanism |
|---|---|---|
eventpoll |
Wave 6 #3855 + Wave 7 #4047 | ChatOllama _async_client not closed (Wave 6) → also not closed when called inside a running loop (Wave 7). |
pidfd from healthcheck |
Wave 7 #4047 | urlopen no timeout= → child hangs → reparented to PID 1 with pidfd held. |
pidfd from Playwright fallback |
Round 8 / #3971 | Production image lacks Chromium binary; Playwright invocation opens pidfd then fails. |
| WAL/SHM accumulation | Wave 5 / ADR-0004 | SQLCipher+WAL leaks handles on out-of-order close; periodic engine.dispose() resets the pool. |
| Per-thread engine FDs | Wave 5 #3441 | Removed per-thread NullPool engines entirely; shared per-user QueuePool. |
| HTTP session sockets | Wave 1 / Wave 3 | SafeSession / BaseDownloader close-in-finally discipline. |
asyncio.new_event_loop |
Wave 4 #3018 | Replaced manual loop creation with asyncio.run() in news_strategy.py. |
Use this table to skip the rediscovery step the next time a specific FD type dominates a snapshot.
Intentionally not done (deferred)
These showed up during planning and were deliberately not done. If they get rediscovered as "missing work" by future contributors, please reference this section first.
weakref.finalizedefense-in-depth on the LLM wrappers. Designed and verified safe (no__del__conflicts,__getattr__doesn't intercept_finalizer, no reference cycles). Deferred until a fourth wave of "missed close" leaks justifies adding a new pattern that future contributors must understand. Current explicit-close discipline has held since #2712 / #2732 / #3018.- LLM caching in
get_llm(). Bounding totalChatOllamainstances to N=distinct configs would make leak shapes architecturally impossible. Orthogonal optimization, deferred — adds complexity around settings invalidation and multi-tenant isolation. - Pre-commit hook flagging
get_llm()callers withoutclose(). Useful in principle, deferred — high false-positive risk (caller-passed LLMs, lazy-init holders, factory-returned LLMs all legitimately don't close). Needs a careful design. - Dedicated
/api/v1/health/fddiagnostic with eventpoll-inode dedupe. PR #3033 stalled at a basic version (Windows + RLIM_INFINITY bugs); PR #3036 addedutilities/fd_monitor.pyfor cross-platform FD reading. A type/inode-breakdown extension is feasible but deferred until an active leak hunt actually needs it. app_logs(ResearchLog) retention setting + scheduled cleanup job. Identified in Round 9; the only audit finding that wasn't refuted but also isn't impactful enough today. Trigger to do this work: a user reports the SQLCipher DB growing >100 MB and complains about query slowdown, OR a self-hosted instance keeping research logs for >1 year sees DB bloat, OR the metrics dashboard starts noting research-detail page load slowdown traced toapp_logsjoins. Implementation sketch: addlogs.research_log_retention_daystodefaults/default_settings.json(default0= disabled, preserves current behavior; e.g.30to keep last 30 days). Extend the existingBackgroundJobSchedulerinscheduler/background.py(which already runscleanup_inactive_usershourly and_reload_configevery 30 min) with a daily_cleanup_old_research_logsjob that deletesResearchLogrows older than the retention window. Skip rows belonging to favorited / starred researches if a flag exists. ~30 LOC + a regression test that inserts old rows, triggers the job, asserts old rows are deleted and recent ones survive. Addchangelog.d/<id>.feature.md.
Glossary
_owns_llm— instance flag set in__init__toTruewhen the class fetched its own LLM viaget_llm(),Falsewhen an LLM was injected by the caller. Gates whetherclose()actually closes the LLM.safe_close(resource, name)— helper inutilities/resource_utils.pythat callsresource.close()inside a try/except, logging on failure. Never raises. Used in everyfinallyblock._ldr_closed— sentinel attribute set on inner httpx clients by_close_base_llmto make the function idempotent. Checked withis True(not truthy) so Mock objects without aspecdon't trip the guard.- eventpoll FD — Linux
a_inodefile descriptor type forepoll_create'd kernel objects. Each asyncio event loop registers one. Leaked AsyncClients hold them via the loop's selector.