local-deep-research

mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit e1bc52904e ci+test: retry transient network installs, fix patch.object race (#4302 )

Three small CI-resilience fixes pulled out of yesterday's hairy release
recovery (run #2524). All three target the kind of failure that *should*
have been a one-line retry but instead burned a full release cycle.

1. ``pre-commit.yml`` — wrap ``pre-commit run --all-files`` in
   ``nick-fields/retry@v4.0.0`` (2 attempts). The motivating failure
   was ``Building wheel for shellcheck_py`` hitting HTTP 502 fetching
   the shellcheck binary during a hook environment install. Pre-commit
   downloads hook envs from many external sources (PyPI, GitHub
   releases, npm registry), and a single 5xx anywhere fails the
   whole job. The second attempt benefits from the partial cache that
   the first attempt populated and almost always succeeds.

   Implementation note: this had to replicate ``pre-commit/action@v3.0.1``
   inline (set-PY-for-cache-key → cache ``~/.cache/pre-commit`` →
   ``pip install pre-commit`` → ``pre-commit run``) because the upstream
   action exposes no retry knob and ``nick-fields/retry`` can only wrap
   shell commands, not ``uses:`` action invocations. Cache key matches
   what ``pre-commit/action`` would compute, so existing cache entries
   keep working. Two attempts only — a hook env that fails to install
   twice in a row is not a transient outage.

2. ``playwright-webkit-tests.yml`` — wrap the two ``npx playwright
   install chromium webkit --with-deps`` invocations (one per safari
   job) in the same retry wrapper. Each invocation downloads ~150 MB of
   browser binaries from the playwright CDN; a single CDN hiccup mid-
   download has failed the whole release pipeline before. Split out
   ``npm ci`` to a separate step so its (different) cache is unaffected.

3. ``test_library_rag_service_coverage.py`` — fix the thread race in
   ``TestMergeAndPersistLocked.test_concurrent_writers_both_chunks_survive``
   that produced the spurious "Read-modify-write race regressed"
   failure in run #2524. Each worker thread was applying
   ``patch.object(mod.FAISS, "load_local", ...)`` inside its own
   ``with`` block. ``patch.object`` rewrites a module attribute and
   is not thread-safe: when the first worker's block exits, it
   restores whatever it captured as "the original" at entry — which
   can be the second worker's lambda OR the real ``load_local``,
   depending on which thread entered patch first. If the real
   ``load_local`` becomes active mid-test, the second worker's
   reload step raises on the empty ``shared.faiss`` (only touched,
   not a real index), the production code's ``except`` branch falls
   back to the caller's stale MagicMock, the worker's ``save_local``
   is a no-op (setup MagicMock has no side effect), and the worker's
   chunk never reaches ``disk_state``. The assertion then blames the
   production lock, which is fine. Moving the patch out to wrap both
   threads removes the race entirely — the patched value is the only
   value visible while either worker runs. Verified locally over 10
   consecutive runs (was flaky enough to fail once in ~3 runs before).

2026-05-24 18:16:20 +02:00