Files
local-deep-research/docs/developing
LearningCircuit e79a9fb76a docs(resource-cleanup) + fix: Round 9 audit results + per-user lock-dict cleanup (#4077)
* docs(resource-cleanup): Round 9 audit results + conditional deferred fixes

Capture the Round 9 broader-resource-audit results so:
- Future contributors don't re-audit the same paths
- If a relevant production symptom ever appears, the doc points
  directly at a pre-thought-through conditional fix

Round 9 ran two passes of three parallel agents looking for resource
leaks BEYOND FDs (memory/cache growth, thread/lock lifecycle, DB state
hygiene). Round 1 produced six HIGH-confidence findings; Round 2
verification refuted four of them and downgraded one.

Added to the audit ledger (next to the existing Wave 7 entry):

- Refuted findings with WHY they were refuted:
  - @cache on get_available_providers (called with None — hashable,
    cardinality 1; dicts would raise TypeError, not cache silently)
  - ThreadLocalSession identity-map growth (expire_on_commit=True
    default clears the map on every commit)
  - token_usage table unbounded growth (design-intentional permanent
    audit table; time-series compound indexes; /api/context-overflow
    queries historical windows)
  - search_calls table unbounded growth (same shape and verdict)

- Three per-user lock dicts (_user_init_locks, _user_locks,
  _user_critical_locks): technically correct that they never clean
  up on user delete, but ~296 bytes per user × 3 dicts = ~900 KB
  ceiling at 1000 users. Practically negligible.

- app_logs (ResearchLog) table — the one finding that survived
  verification as a real but small concern. No auto-retention; only
  cleaned by cascade-delete when parent Research row is manually
  removed. For users keeping all research, logs accumulate.

Added to "Intentionally not done (deferred)":

- app_logs retention setting + scheduled cleanup job. Includes the
  trigger conditions that would justify the work and the
  implementation sketch (settings key, daily APScheduler job,
  regression test, news fragment).

- Per-user lock dict cleanup on user delete. Cosmetic; included with
  trigger conditions and one-line-per-file sketch so it's actionable
  if multi-user deployments ever see it.

No code changes. Documentation only.

* fix(resource-cleanup): pop per-user lock-dict entries on user close

Three module-level per-user lock dicts had no removal hook, so each
accumulated one ``threading.Lock`` entry per username over the
process lifetime:

- ``_user_init_locks`` in ``database/library_init.py`` (serializes
  collection-init check-then-insert)
- ``_user_locks`` in ``database/backup/backup_service.py`` (per-user
  backup serialization)
- ``_user_critical_locks`` on ``QueueProcessorV2`` (per-user
  count-then-start critical section)

The ceiling was ~296 bytes/entry × 3 dicts ≈ ~900 bytes per user
across all three — bounded by total user count, microscopic relative
to the eventpoll FD leak that motivated the original investigation,
but real for long-lived multi-user instances with user-account
churn. Identified in Round 9 of the broader resource-leak audit
(see docs/developing/resource-cleanup.md).

The fix:

- Each module now exposes a ``pop_user_*_lock(username)`` function
  (or method for the QueueProcessor instance dict) that pops the
  entry under the existing per-dict lock.
- ``connection_cleanup._pop_per_user_locks(username)`` is a shared
  helper that lazy-imports and calls all three with individual
  try/except blocks so a failure in one doesn't skip the others.
- The helper is invoked from both user-close paths:
  - ``cleanup_idle_connections`` (the 5-minute sweeper) after
    ``db_manager.close_user_database``
  - ``web/auth/routes.py`` after the logout and password-change
    ``close_user_database`` calls

Pattern mirrors the existing per-user cleanup in those code paths
(scheduler.unregister_user, session_password_store.clear_*).

Tests in ``tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks``:

- Unit test: populate all three dicts, call the helper, assert all
  three entries are gone.
- Idempotency: pop on a never-registered user must not raise.
- Integration: ``cleanup_idle_connections`` actually invokes the
  helper for each user it closes (verified via the library-init dict
  for "alice").

Doc updated: the entry that R9A2 identified as "technically correct,
practically negligible" is moved from the audit ledger's findings
list into a "Fixed in this PR" subsection; the matching
deferred-fix entry in "Intentionally not done" is removed.

Adds a towncrier bugfix fragment.

* review: address self-review findings on PR #4077

Four fixes from a multi-round agent review of this PR:

1. Move ``_pop_per_user_locks`` outside the ``close_user_database``
   try/except in ``connection_cleanup.py``. Previously the pop was
   inside the same try block, so a DB-close failure — the path that
   ``test_close_failure_does_not_abort_loop`` already exercises —
   skipped the pop and leaked the lock-dict entry. New test
   ``test_pop_runs_even_when_close_user_database_fails`` pins this.

2. Bump three ``logger.debug`` to ``logger.warning`` in
   ``_pop_per_user_locks``. Matches the sibling scheduler-unregister
   error handler in the same module; debug-level silently masked
   lock-dict accumulation across cycles.

3. Doc accuracy fix in ``docs/developing/resource-cleanup.md``.
   The entry called all three dicts "module-level" but
   ``_user_critical_locks`` is an instance attribute on
   ``QueueProcessorV2``. Rewrote to distinguish module-level dicts
   from the instance attribute and to note the pop now runs outside
   the close try/except.

4. Integration test pre-populates all three lock-dict entries and
   asserts all three are absent post-cleanup, not just
   ``_user_init_locks``. Switched the test username from "alice"
   (used by other tests in the module) to a dedicated sentinel.

Tests: 23/23 in ``tests/web/auth/test_connection_cleanup.py``.

Race-condition concerns flagged in Round 1 (TOCTOU between pop and
``_get_user_*_lock``) were verified in Round 2 to be either guarded
by Python's reference-counted lock semantics (``with lock:`` keeps
the original lock object alive after dict pop) or bounded to a +1
race window across multiple browser sessions — not ship-blocking.
The narrow ``library_init`` race surfaces as a propagated
``IntegrityError`` on collection insert, not silent corruption.
2026-05-17 15:40:11 +02:00
..