local-deep-research

mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit 6f18a711d2 docs(resource-cleanup): expand Wave 7 with full audit ledger (#4054 )

* docs(resource-cleanup): expand Wave 7 with full audit ledger

Replaces the brief "follow-up gaps" bullet list with the full ledger
of what the broader audit during #4047 actually examined, split into
four scannable subsections:

- Checked and confirmed clean: non-Ollama LLM providers, HTTP session
  lifecycle, subprocess/pidfd, asyncio loops, file handles, SocketIO
  connect/disconnect.
- Flagged then verified NOT a real FD leak: OllamaEmbeddings (uses
  the deprecated langchain_community class with no httpx client),
  auth_db + journal_quality engines escaping shutdown_databases
  (bounded pools, not growing), LibraryRAGService in three RAG SSE
  endpoints (RAM churn, no FDs — FAISS uses pickle.load, embeddings
  hold no FDs per the item above, SentenceTransformer mmaps are
  process-wide singletons).
- Minor findings: daemon threads without explicit shutdown,
  abandoned-research cleanup on socket disconnect — both reaped at
  process exit, not steady-state leaks.
- Future-proofing note: ``langchain_community.embeddings.OllamaEmbeddings``
  is deprecated; the replacement ``langchain_ollama.OllamaEmbeddings``
  DOES carry ``_client`` and ``_async_client`` (verified by direct
  introspection), so when LDR migrates the in-running-loop eventpoll
  leak class will reappear for embeddings unless ``_close_base_llm``
  is generalized.

Direct introspection done at audit time confirms each verdict:
``[a for a in dir(e) if 'client' in a.lower()]`` returned ``[]`` for
the deprecated class and a non-empty list for the new class. This
ledger saves the next contributor from re-running the same agent
sweep when investigating a future FD spike.

No code changes.

* docs(resource-cleanup): add Round-8 pidfd finding (fixed by #3971)

The Wave 7 ledger covered the eventpoll-FD investigation but didn't
mention the residual pidfd accumulation we discovered post-merge. A
follow-up Round-8 investigation (8 parallel agents, 2 rounds + direct
/proc inspection on a live prerelease container) traced ~3.6
pidfds/hour, steady-state ~29, to:

  _check_subscription → quick_summary
    → FullSearchResults.batch_fetch_and_extract
    → AutoHTMLDownloader fallback
    → PlaywrightHTMLDownloader._fetch_with_playwright
    → sync_playwright().start()
    → asyncio.create_subprocess_exec(node-driver)  # opens pidfd
    → driver fails (Chromium not installed in production ldr stage)
    → pidfd not closed on the failed-child exit

CPython 3.14 ruled out as a confounder: subprocess.py uses
waitpid(WNOHANG) polling, never opens pidfds. Only
asyncio.create_subprocess_* and multiprocessing.Process can open them
on Linux + Python 3.9+ via PidfdChildWatcher.

PR #3971 (already merged) addresses this from a different angle: it
makes web.enable_javascript_rendering default false, so
AutoHTMLDownloader short-circuits before invoking Playwright. No
subprocess spawned → no pidfd opened. Original motivation for #3971
was the confusing tracebacks reported in #3826; the FD-leak finding
is the second motivation, captured here so a future reader sees both.

The new bullet sits in Section B (flagged-then-verified-then-fixed)
because the leak was real but is now resolved upstream.

* docs(resource-cleanup): add FD-leak debugging playbook + CI considerations

Add a new "Debugging FD leaks — playbook for the next one" section
between the History (Waves 1-7) and "Intentionally not done" parts of
the doc, capturing the diagnostic flow we developed across Waves 6
and 7 so future contributors don't re-derive it from scratch.

Includes:

- Symptoms that justify treating an issue as an FD leak (OSError 24,
  static-asset MIME errors, High FD count warnings, healthcheck
  hangs).
- Host-side and inside-container snapshot scripts that work even when
  the container is too FD-starved for docker exec (host-side via
  sudo + /proc/$P/fd) and through the entrypoint's UID drop
  (--user 0 to docker exec).
- Lookup table mapping each anon_inode / socket / pipe / REG flavor
  to its likely Python-level source and the path to deep-dive (e.g.
  /proc/PID/fdinfo/N's Pid: line for pidfds).
- A pinpointing recipe per FD type — eventpoll (asyncio/httpx),
  pidfd (asyncio.create_subprocess / multiprocessing.Process),
  WAL/SHM (SQLCipher engine.dispose).
- Pointer to the existing in-codebase instrumentation: _count_open_fds,
  the periodic Resource monitor log, fd_monitor.py, and the
  RUN_MANUAL_SMOKE-gated tests/manual_smoke/test_fd_smoke.py harness.
- Honest discussion of why an automated per-PR FD-growth assertion is
  hard (transient FDs, CI-environment subprocess noise, namespace
  differences, slow-drip leaks needing hours of uptime) and what a
  nightly long-run job would look like if the team chooses to invest
  in one.
- A "which Wave fixed which leak class" reference table so the next
  reporter can recognize a class and skip to the relevant precedent.

No code changes. Pure documentation.

* docs(resource-cleanup): add development-time detection + bpftrace recipes

Extend the FD-leak debugging playbook with two industry-standard
techniques that would have caught Waves 6 and 7 earlier, drawn from
upstream Python docs and the wider production-tracing literature:

1. **bpftrace syscall-level pinpointing** (in the per-FD-type
   section). Trace pidfd_open / epoll_create1 / etc. on the host
   targeting the container's host PID; produces a histogram of every
   user stack that triggered the syscall, ranked by frequency. The
   hot stacks are the culprits. Would have caught the Playwright
   pidfd leak in seconds.

2. **Development-time detection (new subsection 4a)** — catches
   leaks at test time before they ship:
   - PYTHONASYNCIODEBUG=1 + -W default::ResourceWarning. Per the
     asyncio dev docs, unclosed transports emit ResourceWarning at GC
     time; the filter actually displays them. Would have surfaced
     the Wave 7 in-running-loop skip in any test that exercised
     ainvoke + safe_close on ChatOllama.
   - python -X dev for a one-flag local dev mode bundling
     ResourceWarning + asyncio debug + warnings as default.
   - pyproject.toml [tool.pytest.ini_options] examples for both
     "display" and "error" filter modes (with a caveat that error
     mode needs a targeted subset, not the whole suite, because
     third-party libs also emit ResourceWarning).
   - psutil's num_fds / open_files / connections as the
     cross-platform alternative to /proc/self/fd for unit tests on
     macOS dev environments.
   - tracemalloc + objgraph as the next-level tool when a leak is
     reproducible — diff allocations before/after, then render the
     reference chain holding the leaked wrapper alive.

No code changes. The new tooling is recommendations only; no
mandatory pytest config change in this commit. Future work could
enable PYTHONASYNCIODEBUG=1 in the CI test environment if the
overhead is acceptable.

Citations to docs.python.org are inline for the load-bearing
ResourceWarning claim.

* test(fd-canary): pin asyncio.create_subprocess pidfd lifecycle in CI

Add ``TestAsyncioSubprocessFDBaseline`` to
``tests/utilities/test_close_base_llm.py`` with two regression tests
that run on every PR:

1. ``test_no_fd_growth_across_asyncio_subprocess_cycles`` — spawns
   ``/bin/true`` via ``asyncio.create_subprocess_exec`` 10 times and
   asserts total FD count delta ≤ +2. Pins the pidfd FD class against
   the child-watcher leak shape.

2. ``test_no_fd_growth_when_subprocess_fails_to_exec`` — same shape
   but with a deliberately-missing binary, mirroring the *exact*
   Wave-7 production failure mode (Playwright's Node.js driver being
   spawned, kernel returning ENOENT because Chromium wasn't
   installed, child watcher still expected to clean up the pidfd it
   opened *before* the failed exec).

Why this is the right level
---------------------------
LDR's own code does NOT call ``asyncio.create_subprocess_*`` (verified
in R8C1). The production leak came from a transitive dependency
(Playwright). So we cannot test LDR's call sites directly — there are
none. Instead these tests pin the *platform baseline*: on this Python
version, repeated asyncio subprocess cycles must not leak FDs. If a
future Python upgrade, a child-watcher change, or a new direct
asyncio.create_subprocess call in LDR breaks the close semantics, the
next PR's CI fails on these tests — which is the canary signal we
want.

Linux-only via ``sys.platform != "linux"`` skip. pidfd_open is a
Linux syscall; macOS uses a different watcher and Windows uses
ProactorEventLoop. Both 'pass by virtue of nothing to leak', so
restricting to Linux keeps the signal sharp (a failure on Linux is
actionable; a pass on macOS is uninformative).

Same +2 FD slack we use for the eventpoll canary above. A real
1-FD-per-iter leak across 10 iterations would land at delta=10,
well past the threshold.

Doc reference
-------------
Updated ``docs/developing/resource-cleanup.md`` "Existing
instrumentation" section to enumerate all four in-CI FD-growth
canaries (two eventpoll, two pidfd) so future contributors see at a
glance what's already guarded and where to extend coverage when a
new leak class is found.

2026-05-16 20:01:04 +02:00