mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
* docs(resource-cleanup): expand Wave 7 with full audit ledger Replaces the brief "follow-up gaps" bullet list with the full ledger of what the broader audit during #4047 actually examined, split into four scannable subsections: - Checked and confirmed clean: non-Ollama LLM providers, HTTP session lifecycle, subprocess/pidfd, asyncio loops, file handles, SocketIO connect/disconnect. - Flagged then verified NOT a real FD leak: OllamaEmbeddings (uses the deprecated langchain_community class with no httpx client), auth_db + journal_quality engines escaping shutdown_databases (bounded pools, not growing), LibraryRAGService in three RAG SSE endpoints (RAM churn, no FDs — FAISS uses pickle.load, embeddings hold no FDs per the item above, SentenceTransformer mmaps are process-wide singletons). - Minor findings: daemon threads without explicit shutdown, abandoned-research cleanup on socket disconnect — both reaped at process exit, not steady-state leaks. - Future-proofing note: ``langchain_community.embeddings.OllamaEmbeddings`` is deprecated; the replacement ``langchain_ollama.OllamaEmbeddings`` DOES carry ``_client`` and ``_async_client`` (verified by direct introspection), so when LDR migrates the in-running-loop eventpoll leak class will reappear for embeddings unless ``_close_base_llm`` is generalized. Direct introspection done at audit time confirms each verdict: ``[a for a in dir(e) if 'client' in a.lower()]`` returned ``[]`` for the deprecated class and a non-empty list for the new class. This ledger saves the next contributor from re-running the same agent sweep when investigating a future FD spike. No code changes. * docs(resource-cleanup): add Round-8 pidfd finding (fixed by #3971) The Wave 7 ledger covered the eventpoll-FD investigation but didn't mention the residual pidfd accumulation we discovered post-merge. A follow-up Round-8 investigation (8 parallel agents, 2 rounds + direct /proc inspection on a live prerelease container) traced ~3.6 pidfds/hour, steady-state ~29, to: _check_subscription → quick_summary → FullSearchResults.batch_fetch_and_extract → AutoHTMLDownloader fallback → PlaywrightHTMLDownloader._fetch_with_playwright → sync_playwright().start() → asyncio.create_subprocess_exec(node-driver) # opens pidfd → driver fails (Chromium not installed in production ldr stage) → pidfd not closed on the failed-child exit CPython 3.14 ruled out as a confounder: subprocess.py uses waitpid(WNOHANG) polling, never opens pidfds. Only asyncio.create_subprocess_* and multiprocessing.Process can open them on Linux + Python 3.9+ via PidfdChildWatcher. PR #3971 (already merged) addresses this from a different angle: it makes web.enable_javascript_rendering default false, so AutoHTMLDownloader short-circuits before invoking Playwright. No subprocess spawned → no pidfd opened. Original motivation for #3971 was the confusing tracebacks reported in #3826; the FD-leak finding is the second motivation, captured here so a future reader sees both. The new bullet sits in Section B (flagged-then-verified-then-fixed) because the leak was real but is now resolved upstream. * docs(resource-cleanup): add FD-leak debugging playbook + CI considerations Add a new "Debugging FD leaks — playbook for the next one" section between the History (Waves 1-7) and "Intentionally not done" parts of the doc, capturing the diagnostic flow we developed across Waves 6 and 7 so future contributors don't re-derive it from scratch. Includes: - Symptoms that justify treating an issue as an FD leak (OSError 24, static-asset MIME errors, High FD count warnings, healthcheck hangs). - Host-side and inside-container snapshot scripts that work even when the container is too FD-starved for docker exec (host-side via sudo + /proc/$P/fd) and through the entrypoint's UID drop (--user 0 to docker exec). - Lookup table mapping each anon_inode / socket / pipe / REG flavor to its likely Python-level source and the path to deep-dive (e.g. /proc/PID/fdinfo/N's Pid: line for pidfds). - A pinpointing recipe per FD type — eventpoll (asyncio/httpx), pidfd (asyncio.create_subprocess / multiprocessing.Process), WAL/SHM (SQLCipher engine.dispose). - Pointer to the existing in-codebase instrumentation: _count_open_fds, the periodic Resource monitor log, fd_monitor.py, and the RUN_MANUAL_SMOKE-gated tests/manual_smoke/test_fd_smoke.py harness. - Honest discussion of why an automated per-PR FD-growth assertion is hard (transient FDs, CI-environment subprocess noise, namespace differences, slow-drip leaks needing hours of uptime) and what a nightly long-run job would look like if the team chooses to invest in one. - A "which Wave fixed which leak class" reference table so the next reporter can recognize a class and skip to the relevant precedent. No code changes. Pure documentation. * docs(resource-cleanup): add development-time detection + bpftrace recipes Extend the FD-leak debugging playbook with two industry-standard techniques that would have caught Waves 6 and 7 earlier, drawn from upstream Python docs and the wider production-tracing literature: 1. **bpftrace syscall-level pinpointing** (in the per-FD-type section). Trace pidfd_open / epoll_create1 / etc. on the host targeting the container's host PID; produces a histogram of every user stack that triggered the syscall, ranked by frequency. The hot stacks are the culprits. Would have caught the Playwright pidfd leak in seconds. 2. **Development-time detection (new subsection 4a)** — catches leaks at test time before they ship: - PYTHONASYNCIODEBUG=1 + -W default::ResourceWarning. Per the asyncio dev docs, unclosed transports emit ResourceWarning at GC time; the filter actually displays them. Would have surfaced the Wave 7 in-running-loop skip in any test that exercised ainvoke + safe_close on ChatOllama. - python -X dev for a one-flag local dev mode bundling ResourceWarning + asyncio debug + warnings as default. - pyproject.toml [tool.pytest.ini_options] examples for both "display" and "error" filter modes (with a caveat that error mode needs a targeted subset, not the whole suite, because third-party libs also emit ResourceWarning). - psutil's num_fds / open_files / connections as the cross-platform alternative to /proc/self/fd for unit tests on macOS dev environments. - tracemalloc + objgraph as the next-level tool when a leak is reproducible — diff allocations before/after, then render the reference chain holding the leaked wrapper alive. No code changes. The new tooling is recommendations only; no mandatory pytest config change in this commit. Future work could enable PYTHONASYNCIODEBUG=1 in the CI test environment if the overhead is acceptable. Citations to docs.python.org are inline for the load-bearing ResourceWarning claim. * test(fd-canary): pin asyncio.create_subprocess pidfd lifecycle in CI Add ``TestAsyncioSubprocessFDBaseline`` to ``tests/utilities/test_close_base_llm.py`` with two regression tests that run on every PR: 1. ``test_no_fd_growth_across_asyncio_subprocess_cycles`` — spawns ``/bin/true`` via ``asyncio.create_subprocess_exec`` 10 times and asserts total FD count delta ≤ +2. Pins the pidfd FD class against the child-watcher leak shape. 2. ``test_no_fd_growth_when_subprocess_fails_to_exec`` — same shape but with a deliberately-missing binary, mirroring the *exact* Wave-7 production failure mode (Playwright's Node.js driver being spawned, kernel returning ENOENT because Chromium wasn't installed, child watcher still expected to clean up the pidfd it opened *before* the failed exec). Why this is the right level --------------------------- LDR's own code does NOT call ``asyncio.create_subprocess_*`` (verified in R8C1). The production leak came from a transitive dependency (Playwright). So we cannot test LDR's call sites directly — there are none. Instead these tests pin the *platform baseline*: on this Python version, repeated asyncio subprocess cycles must not leak FDs. If a future Python upgrade, a child-watcher change, or a new direct asyncio.create_subprocess call in LDR breaks the close semantics, the next PR's CI fails on these tests — which is the canary signal we want. Linux-only via ``sys.platform != "linux"`` skip. pidfd_open is a Linux syscall; macOS uses a different watcher and Windows uses ProactorEventLoop. Both 'pass by virtue of nothing to leak', so restricting to Linux keeps the signal sharp (a failure on Linux is actionable; a pass on macOS is uninformative). Same +2 FD slack we use for the eventpoll canary above. A real 1-FD-per-iter leak across 10 iterations would land at delta=10, well past the threshold. Doc reference ------------- Updated ``docs/developing/resource-cleanup.md`` "Existing instrumentation" section to enumerate all four in-CI FD-growth canaries (two eventpoll, two pidfd) so future contributors see at a glance what's already guarded and where to extend coverage when a new leak class is found.