mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
Three small CI-resilience fixes pulled out of yesterday's hairy release recovery (run #2524). All three target the kind of failure that *should* have been a one-line retry but instead burned a full release cycle. 1. ``pre-commit.yml`` — wrap ``pre-commit run --all-files`` in ``nick-fields/retry@v4.0.0`` (2 attempts). The motivating failure was ``Building wheel for shellcheck_py`` hitting HTTP 502 fetching the shellcheck binary during a hook environment install. Pre-commit downloads hook envs from many external sources (PyPI, GitHub releases, npm registry), and a single 5xx anywhere fails the whole job. The second attempt benefits from the partial cache that the first attempt populated and almost always succeeds. Implementation note: this had to replicate ``pre-commit/action@v3.0.1`` inline (set-PY-for-cache-key → cache ``~/.cache/pre-commit`` → ``pip install pre-commit`` → ``pre-commit run``) because the upstream action exposes no retry knob and ``nick-fields/retry`` can only wrap shell commands, not ``uses:`` action invocations. Cache key matches what ``pre-commit/action`` would compute, so existing cache entries keep working. Two attempts only — a hook env that fails to install twice in a row is not a transient outage. 2. ``playwright-webkit-tests.yml`` — wrap the two ``npx playwright install chromium webkit --with-deps`` invocations (one per safari job) in the same retry wrapper. Each invocation downloads ~150 MB of browser binaries from the playwright CDN; a single CDN hiccup mid- download has failed the whole release pipeline before. Split out ``npm ci`` to a separate step so its (different) cache is unaffected. 3. ``test_library_rag_service_coverage.py`` — fix the thread race in ``TestMergeAndPersistLocked.test_concurrent_writers_both_chunks_survive`` that produced the spurious "Read-modify-write race regressed" failure in run #2524. Each worker thread was applying ``patch.object(mod.FAISS, "load_local", ...)`` inside its own ``with`` block. ``patch.object`` rewrites a module attribute and is not thread-safe: when the first worker's block exits, it restores whatever it captured as "the original" at entry — which can be the second worker's lambda OR the real ``load_local``, depending on which thread entered patch first. If the real ``load_local`` becomes active mid-test, the second worker's reload step raises on the empty ``shared.faiss`` (only touched, not a real index), the production code's ``except`` branch falls back to the caller's stale MagicMock, the worker's ``save_local`` is a no-op (setup MagicMock has no side effect), and the worker's chunk never reaches ``disk_state``. The assertion then blames the production lock, which is fine. Moving the patch out to wrap both threads removes the race entirely — the patched value is the only value visible while either worker runs. Verified locally over 10 consecutive runs (was flaky enough to fail once in ~3 runs before).