mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-16 12:02:34 +03:00
* docs(resource-cleanup): Round 9 audit results + conditional deferred fixes
Capture the Round 9 broader-resource-audit results so:
- Future contributors don't re-audit the same paths
- If a relevant production symptom ever appears, the doc points
directly at a pre-thought-through conditional fix
Round 9 ran two passes of three parallel agents looking for resource
leaks BEYOND FDs (memory/cache growth, thread/lock lifecycle, DB state
hygiene). Round 1 produced six HIGH-confidence findings; Round 2
verification refuted four of them and downgraded one.
Added to the audit ledger (next to the existing Wave 7 entry):
- Refuted findings with WHY they were refuted:
- @cache on get_available_providers (called with None — hashable,
cardinality 1; dicts would raise TypeError, not cache silently)
- ThreadLocalSession identity-map growth (expire_on_commit=True
default clears the map on every commit)
- token_usage table unbounded growth (design-intentional permanent
audit table; time-series compound indexes; /api/context-overflow
queries historical windows)
- search_calls table unbounded growth (same shape and verdict)
- Three per-user lock dicts (_user_init_locks, _user_locks,
_user_critical_locks): technically correct that they never clean
up on user delete, but ~296 bytes per user × 3 dicts = ~900 KB
ceiling at 1000 users. Practically negligible.
- app_logs (ResearchLog) table — the one finding that survived
verification as a real but small concern. No auto-retention; only
cleaned by cascade-delete when parent Research row is manually
removed. For users keeping all research, logs accumulate.
Added to "Intentionally not done (deferred)":
- app_logs retention setting + scheduled cleanup job. Includes the
trigger conditions that would justify the work and the
implementation sketch (settings key, daily APScheduler job,
regression test, news fragment).
- Per-user lock dict cleanup on user delete. Cosmetic; included with
trigger conditions and one-line-per-file sketch so it's actionable
if multi-user deployments ever see it.
No code changes. Documentation only.
* fix(resource-cleanup): pop per-user lock-dict entries on user close
Three module-level per-user lock dicts had no removal hook, so each
accumulated one ``threading.Lock`` entry per username over the
process lifetime:
- ``_user_init_locks`` in ``database/library_init.py`` (serializes
collection-init check-then-insert)
- ``_user_locks`` in ``database/backup/backup_service.py`` (per-user
backup serialization)
- ``_user_critical_locks`` on ``QueueProcessorV2`` (per-user
count-then-start critical section)
The ceiling was ~296 bytes/entry × 3 dicts ≈ ~900 bytes per user
across all three — bounded by total user count, microscopic relative
to the eventpoll FD leak that motivated the original investigation,
but real for long-lived multi-user instances with user-account
churn. Identified in Round 9 of the broader resource-leak audit
(see docs/developing/resource-cleanup.md).
The fix:
- Each module now exposes a ``pop_user_*_lock(username)`` function
(or method for the QueueProcessor instance dict) that pops the
entry under the existing per-dict lock.
- ``connection_cleanup._pop_per_user_locks(username)`` is a shared
helper that lazy-imports and calls all three with individual
try/except blocks so a failure in one doesn't skip the others.
- The helper is invoked from both user-close paths:
- ``cleanup_idle_connections`` (the 5-minute sweeper) after
``db_manager.close_user_database``
- ``web/auth/routes.py`` after the logout and password-change
``close_user_database`` calls
Pattern mirrors the existing per-user cleanup in those code paths
(scheduler.unregister_user, session_password_store.clear_*).
Tests in ``tests/web/auth/test_connection_cleanup.py::TestPopPerUserLocks``:
- Unit test: populate all three dicts, call the helper, assert all
three entries are gone.
- Idempotency: pop on a never-registered user must not raise.
- Integration: ``cleanup_idle_connections`` actually invokes the
helper for each user it closes (verified via the library-init dict
for "alice").
Doc updated: the entry that R9A2 identified as "technically correct,
practically negligible" is moved from the audit ledger's findings
list into a "Fixed in this PR" subsection; the matching
deferred-fix entry in "Intentionally not done" is removed.
Adds a towncrier bugfix fragment.
* review: address self-review findings on PR #4077
Four fixes from a multi-round agent review of this PR:
1. Move ``_pop_per_user_locks`` outside the ``close_user_database``
try/except in ``connection_cleanup.py``. Previously the pop was
inside the same try block, so a DB-close failure — the path that
``test_close_failure_does_not_abort_loop`` already exercises —
skipped the pop and leaked the lock-dict entry. New test
``test_pop_runs_even_when_close_user_database_fails`` pins this.
2. Bump three ``logger.debug`` to ``logger.warning`` in
``_pop_per_user_locks``. Matches the sibling scheduler-unregister
error handler in the same module; debug-level silently masked
lock-dict accumulation across cycles.
3. Doc accuracy fix in ``docs/developing/resource-cleanup.md``.
The entry called all three dicts "module-level" but
``_user_critical_locks`` is an instance attribute on
``QueueProcessorV2``. Rewrote to distinguish module-level dicts
from the instance attribute and to note the pop now runs outside
the close try/except.
4. Integration test pre-populates all three lock-dict entries and
asserts all three are absent post-cleanup, not just
``_user_init_locks``. Switched the test username from "alice"
(used by other tests in the module) to a dedicated sentinel.
Tests: 23/23 in ``tests/web/auth/test_connection_cleanup.py``.
Race-condition concerns flagged in Round 1 (TOCTOU between pop and
``_get_user_*_lock``) were verified in Round 2 to be either guarded
by Python's reference-counted lock semantics (``with lock:`` keeps
the original lock object alive after dict pop) or bounded to a +1
race window across multiple browser sessions — not ship-blocking.
The narrow ``library_init`` race surfaces as a propagated
``IntegrityError`` on collection insert, not silent corruption.