mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit 1f0b0a4a95 ci(release): build-once-promote refactor for Docker pipeline (#3977 )

* ci(release): build-once-promote refactor for Docker pipeline

Today the release pipeline builds the Docker image twice — once in
prerelease-docker.yml for "testing" and again in docker-publish.yml for
the actual release. The image you tested is not the image you ship: base
layer patches, transitive deps, and apt/pip resolution can diverge
between the two builds.

This refactor makes prerelease-docker.yml the canonical build and turns
docker-publish.yml into a thin retag step. `docker buildx imagetools
create` is a registry-side metadata operation that takes seconds and
preserves the manifest digest, so the released image is bit-identical
to the one tested. Cosign signatures, SBOM attestations, and SLSA
provenance are stored at sha256-<digest>.{sig,att} keyed by digest, so
signing once in prerelease covers the release tags transitively.

Pipeline shape changes:

- prerelease-docker.yml is now a reusable workflow (workflow_call) called
  from release.yml. It builds, scans (Trivy), signs (cosign), attests
  the SBOM (cosign attest --type spdxjson, replacing the deprecated
  cosign attach sbom), and emits SLSA provenance. The manifest_digest is
  exposed as a workflow output. The `prerelease` environment gates the
  first build job for human approval.
- docker-publish.yml shrinks from ~457 to ~250 lines. It receives source_tag
  and expected_digest in the dispatch payload, verifies the source digest
  before retag, retags via imagetools create, verifies the digest is
  preserved (defense against re-encoding), re-runs Trivy against the
  digest (catches CVE-DB updates between prerelease and promote),
  verifies the cosign signature transitivity, and runs the existing
  prerelease cleanup loop.
- release.yml adds prerelease-docker to create-release.needs and
  trigger-workflows.needs, so the GitHub Release and the publish dispatch
  only happen after the canonical Docker build completes. The dispatch
  payload now carries source_tag and expected_digest. A new
  cleanup-on-rejection job removes orphan prerelease tags and cosign
  artifacts when the release is rejected (without it, every rejection
  would leave dangling sha256-<digest>.{sig,att} on Docker Hub).
- README cosign verify example updated to the keyless invocation users
  actually need (identity regex pointing at prerelease-docker.yml,
  --certificate-oidc-issuer, --certificate-github-workflow-repository),
  plus the SBOM verify-attestation command.

Notable design decisions (verified across multiple subagent review
rounds):

- SLSA provenance entryPoint stays as release.yml (the top-level caller).
  Per the SLSA GHA buildtype v1 spec and the canonical
  slsa-github-generator behavior, reusable workflows are explicitly NOT
  entryPoints — pointing at prerelease-docker.yml would break verifier
  policies that allowlist trigger workflows.
- Cosign cert identity for verification matches Fulcio's SAN URI, which
  is built from job_workflow_ref — the CALLEE for reusable workflows. So
  the identity regex matches prerelease-docker.yml even though the build
  is invoked from release.yml. Hardened with escaped dots, refs/(heads|tags)/
  constraint, and --certificate-github-workflow-repository to defend
  against the reusable-workflow-identity-reuse class of attacks.
- cleanup-on-rejection uses an allowlist if (failure || cancelled), not
  a denylist (!= 'success'), to avoid firing on `skipped` (e.g. when
  release_exists short-circuits the run). It also fails loudly on 401/403
  from the Docker Hub API so a missing Delete scope on the PAT can't
  silently let orphans accumulate.

Supersedes #3969 (split-environment): the env split is preserved by the
new structure — prerelease env on the called workflow's first job,
release env on create-release/trigger-workflows.

Pre-merge checklist for the maintainer:
- Create the `prerelease` environment in GitHub Settings with the same
  required reviewers as `release`. Without it, the called workflow's
  approval gate auto-creates the env with no protection rules and
  silently approves the build.
- Verify DOCKER_USERNAME / DOCKER_PASSWORD remain repo-level secrets
  (they currently are). Environment-scoped secrets do not propagate
  across reusable workflow calls except via the called job's own
  environment.

* ci(release): fixes from multi-round subagent review

Round 1 surfaced 14 candidate findings; Round 2 verified 7 as real bugs and
refuted 4 as false positives. This commit applies the verified fixes.

CONFIRMED bugs fixed:

1. **Approval gate was per-job, not workflow-wide.** The previous
   `environment: prerelease` on `build-amd64` only let `build-arm64` and
   `security-scan` run pre-approval (GitHub environments are
   job-scoped per docs + community/discussions/174381). Replaced with a
   sentinel `approval-gate` job that all three build jobs `needs:`. Single
   approval click still gates everything, but now actually blocks all
   parallel jobs.

2. **`cleanup-on-rejection` if-condition missed the prerelease-rejection
   path.** When prerelease-docker.result was `failure`, both create-release
   and trigger-workflows became `skipped` (their `if:` requires success),
   and the cleanup `if:` only fired on `failure`/`cancelled` of dependents.
   Added explicit `prerelease-docker.result == 'failure'` clause so the
   most common rejection path actually triggers cleanup.

3. **Trivy re-scan ran AFTER retag.** A failing scan would leave release
   tags `:1.6.9`, `:1.6`, `:latest` publicly published with no rollback.
   Reordered: scan source digest BEFORE retag. Content is bit-identical
   (same digest), so scanning the prerelease tag tests what would be
   promoted — but failure now leaves no public broken tags. Also moved
   cosign verify before retag for the same reason.

4. **Trivy only scanned linux/amd64 by default** against a manifest list
   digest (per Trivy docs + aquasecurity/trivy#7847). Replaced single scan
   with two explicit per-platform invocations
   (`--platform linux/amd64`, then `linux/arm64`) so arm64 layers are also
   gated by the freshness check.

5. **Trivy DB freshness wasn't guaranteed.** apt-installed Trivy may use a
   stale embedded DB. Added explicit `trivy image --download-db-only`
   before the scans so the CVE-DB freshness window the re-scan exists for
   is actually exercised.

6. **`cosign attest` re-runs accumulated attestation layers** (verified via
   cosign 2.x `mutate.go` `dedupeAndReplace`). Added `--replace` to both
   attest calls (SLSA provenance + SBOM). Sigstore spec allows multi-sig
   so `cosign sign` is left as-is.

7. **SLSA provenance values inherited from old code were misleading.**
   - `builder.id`: changed from `https://github.com/actions/runner` (the
     agent binary) to the workflow ref the build is actually defined in
     (per SLSA v0.2 spec — builder.id should be a verifiable trust root).
   - `completeness.{parameters,environment,materials}`: flipped from
     `true` to `false`. The predicate captures no workflow_call inputs,
     no environment, and the build does network I/O — claiming
     completeness was a public signed false statement.
   - `buildInvocationId`: now includes `${run_id}-${run_attempt}` so
     re-runs are distinguishable.

REFUTED (kept as-is, with confidence):

- `imagetools create` does NOT change the digest in this case. Buildx's
  Combine() in util/imagetools/create.go has an explicit short-circuit
  for single-source manifest-list inputs that returns the bytes
  byte-for-byte (no annotations + same registry required, both true here).
- Concurrent rejection digest collision is not a real concern — Docker
  builds in this pipeline are not bit-deterministic (apt, network, file
  timestamps, default provenance attestations all vary).
- The `prerelease-v1.6.9-*` cleanup pattern does NOT collide with
  `prerelease-v1.6.91-*` (trailing dash in the prefix disambiguates).
- Reusable-workflow approval prompts appear inline on the caller run
  page for single-level calls — not a UX regression.

* ci(release): revert most Round 2 review additions

Keep the build-once-promote refactor's structural shape but back out the
defensive additions from commit 68606b299:

- approval-gate sentinel job → revert to `environment: prerelease` on
  build-amd64 only
- SLSA builder.id, completeness flags, buildInvocationId → revert to
  inherited values from the previous docker-publish.yml
- `cosign attest --replace` → drop, accept default append behavior
- Pre-promote Trivy + multi-platform scans + db refresh + pre-promote
  cosign verify → revert to single post-promote scan and post-promote
  cosign verify
- cleanup-on-rejection if-condition → drop the
  `prerelease-docker.result == 'failure'` allowlist clause

Rationale: keep the change set minimal vs main. The defensive additions
were correct in isolation but expand scope of this PR.

* fix(ci): drop invalid --trivyignores flag from raw trivy CLI invocation

The Round 2 promote step used `--trivyignores .trivyignore`, which is the
INPUT name of the aquasecurity/trivy-action wrapper, not a flag of the
raw Trivy binary. The CLI accepts only `--ignorefile` (singular) and
auto-loads `.trivyignore` from cwd by default.

As-was, every release run would hard-fail with `unknown flag:
--trivyignores` from cobra/pflag before any scanning occurred. Removing
the flag is sufficient — Trivy auto-loads the ignorefile from the
checkout root.

prerelease-docker.yml is unaffected: it uses the action wrapper with
`trivyignores: '.trivyignore'` as input, which IS correct usage for the
action layer (it translates to --ignorefile internally via
TRIVY_IGNOREFILE).

Sources:
- https://trivy.dev/latest/docs/references/configuration/cli/trivy_image/
- https://github.com/aquasecurity/trivy-action/blob/master/action.yaml

* ci(release): apply remaining bugs from multi-round review

After Round 4 verification confirmed several deferred findings, applying
the bug fixes the user explicitly requested:

1. Re-introduce the `approval-gate` sentinel job in prerelease-docker.yml.
   GitHub Actions environments are job-scoped, so without a gate sentinel
   `build-arm64` and `security-scan` would run pre-approval — pushing
   the `-arm64` per-arch tag and consuming Trivy minutes regardless of
   whether the maintainer approved or rejected the gate. Single approval
   click still gates everything via `needs: [approval-gate]`.

2. Fix the SLSA `builder.id` to use `${{ github.workflow_ref }}` instead
   of the inherited `https://github.com/actions/runner` agent identity.
   `workflow_ref` resolves to the canonical
   `<owner>/<repo>/.github/workflows/<file>.yml@<callee-ref>` format that
   matches slsa-github-generator's output and that verifier policies can
   pin against.

3. Flip SLSA `completeness.{parameters,environment,materials}` from
   `true` to `false`. The predicate captures no workflow_call inputs, no
   environment, and the build does network I/O — claiming completeness
   was a public signed false statement.

4. Add `${{ github.run_attempt }}` to the SLSA `buildInvocationId` so
   "Re-run failed jobs" attempts are distinguishable.

5. Expand `cleanup-on-rejection` `if:` to include
   `prerelease-docker.result == 'failure'` and `'cancelled'`. Without
   these clauses, the most common rejection path (env approval rejected
   for prerelease) leaves dependents `skipped`, which the existing
   allowlist doesn't match — orphan tags persist on Docker Hub forever.

6. Drop unused `packages: write` from both the called workflow and the
   caller's reusable-workflow block. Docker Hub auth uses
   DOCKER_PASSWORD, not GITHUB_TOKEN; `packages: write` only matters for
   ghcr.io which the project doesn't use.

7. Update `docs/CI_CD_INFRASTRUCTURE.md` Build & Deploy table to reflect
   the build-once-promote split.

8. Update `docs/RELEASE_GUIDE.md` "Automatic Publishing" section to
   describe both approval gates (`prerelease` and `release`).

* ci(release): R5/R6 review fixes — cosign pin, multi-arch SBOM, orphan SBOM

Round 5 (10 agents) and Round 6 (5 agents debunking) verified these
findings, all of which are now applied:

1. **Pin cosign to v2.6.0**. R6A2 verified that `sigstore/cosign-installer@v4.1.2`
   ships cosign v3.0.6 by default. cosign v3 enables `--new-bundle-format`
   ON BY DEFAULT, which changes the on-wire signature/attestation format.
   Mismatched version across sign/verify works in-pipeline (both on v3),
   but downstream verifiers running the README cosign-verify recipe on v2
   would fail. Pinning all three cosign-installer steps to v2.6.0 keeps
   the legacy tag-based sigstore format until we deliberately migrate
   the entire ecosystem.

2. **Multi-arch SBOM via per-arch attestations**. R6A3 verified the claim
   (anchore/syft#1708, actions/attest-sbom#60): syft against a manifest
   list digest only scans the host platform's layers. The previous SBOM
   attestation against the manifest digest claimed to describe both
   amd64 + arm64 but actually only enumerated amd64. ARM64 consumers
   were verifying a misleading SBOM. Fix: iterate over manifest entries
   from `imagetools inspect --raw`, run `syft --platform <plat>` against
   each per-arch digest, and `cosign attest --replace --type spdxjson`
   each per-arch SBOM against the per-arch digest. ALSO keep a
   manifest-list-level SBOM (host arch only) so end-users running
   `cosign verify-attestation user/img:latest` don't get an empty result.

3. **Re-add `--replace` to cosign attest** (both SLSA and SPDX). R5A7's
   deeper analysis enumerated specific failure modes beyond cosmetic
   clutter: Kyverno `count: 1` policies, registry layer count caps,
   audit ambiguity (verify returns success on first matching layer),
   Rekor entry bloat. R3A5 already confirmed `--replace` is per-
   predicate-type, so SLSA and SPDX attestations don't disturb each
   other.

4. **Container-image SBOM no longer orphaned**. R6A4 verified: the
   Syft-produced container SBOMs were uploaded as artifact `sbom` from
   prerelease-docker.yml but never downloaded by `create-release` — they
   were invisible on the GitHub Release page. Fix: download the `sbom`
   artifact, rename to `sbom-container-*` to disambiguate from the
   filesystem `sbom-spdx.json`, and attach to `gh release create`.

5. **Narrow `secrets: inherit` to explicit secrets**. R5A3 flagged that
   `secrets: inherit` propagates ALL repo secrets (PAT_TOKEN,
   OPENROUTER_API_KEY, SERPER_API_KEY, GITHUB_TOKEN) into a workflow
   that only needs Docker Hub creds. Replaced with explicit
   `DOCKER_USERNAME` + `DOCKER_PASSWORD` mapping; the called workflow
   now declares these as required `workflow_call.secrets`.

6. **Drop unused `DEPS_HASH` build-arg**. R5A2 confirmed it was declared
   in the Dockerfile but never referenced in any RUN/COPY, so it never
   busted the Docker layer cache. Cache invalidation already happens
   correctly via `COPY pdm.lock` (file content hash). Removed the ARG
   declaration from Dockerfile and the three `build-args:` passes from
   prerelease-docker.yml.

R6 also REFUTED two earlier claims:
- R5A8's concurrency claim: reusable workflows DO share the caller's
  `workflow_run` and concurrency group (R3A8 was correct). Don't add a
  `concurrency:` block to prerelease-docker.yml — would create a
  separate group and re-introduce the race R5A8 imagined.
- R5A10's harden-runner CVE claim: v2.19.1 (used here) is well after
  the fix versions for both CVE-2026-32946 (v2.16.0) and CVE-2026-25598
  (v2.14.2). No bump needed.

* ci(release): R7 fixes — cosign v2.6.3, drop misleading manifest-level SBOM

Round 7 (5 agents) verified the R5/R6 fixes and surfaced two real bugs:

1. **cosign-installer pinned cosign v2.6.0**, which has two known security
   advisories: GHSA-whqx-f9j3-ch6m (fixed in v2.6.2) and GHSA-w6c6-c85g-mmv6
   (fixed in v2.6.3). Bumped pin to v2.6.3 in all three workflow files so
   the install step picks up the fixes. Same minor (v2.6.x), so no flag
   drift — `--replace`, `--type`, `--bundle`, `--certificate-*` all behave
   identically.

2. **The manifest-level SBOM attestation was misleading**. The previous
   step ran `syft <repo>@<manifest-list-digest>` on an amd64 runner,
   which (per anchore/syft#1708) only enumerates amd64 layers. The SBOM
   was then attested at the manifest-list digest where it was discoverable
   by ALL platform consumers — so an arm64 user verifying `:latest` would
   receive a signed SBOM that lies about the layers they actually pulled.
   The per-arch loop already produces accurate per-platform SBOMs; the
   manifest-level fallback only re-introduced the lie for UX convenience.

   Dropped the manifest-level attest call entirely. Per-arch SBOMs are the
   only honest representation. Updated the README's `cosign
   verify-attestation` recipe to resolve to the per-platform digest first
   (using `jq` over `imagetools inspect --raw`), so end-users on either
   architecture get the SBOM that actually describes what they pulled.
   Removed `sbom.spdx.json` from the workflow artifact + release-staging
   logic since it no longer exists.

3. **Empty-loop assertion**: added a defensive count check before the
   per-arch SBOM loop. If a future buildx output change ever produced
   zero per-arch entries (e.g., all entries marked architecture: unknown),
   the previous code would silently skip the loop and pass CI green with
   no SBOMs. Now it fails loud with the raw manifest dumped for debugging.

Note on round-7 reviewer's other concerns:
- "Pipe-to-while subshell scope": confirmed safe. set -euo pipefail
  inherited; failures in syft/cosign attest abort the subshell, and
  pipefail propagates to the outer step.
- "imagetools inspect --raw stability": OCI image-index spec is stable
  for ~7 years. The jq filter handles the BuildKit attestation pseudo-
  entries via `architecture != "unknown"`.
- "harden-runner v2.19.1 CVEs": false alarm. v2.19.1 is well above the
  fix versions (v2.16.0, v2.14.2). No bump needed.

* ci(release): R8 fixes from 8th review round

Round 8 (5 agents covering Dockerfile, npm/Vite, runtime image, edge
cases, and post-fix smoke check) surfaced 7 real bugs the previous 7
rounds missed. All fixed here, plus a comment per user request.

1. **docker-publish.yml checkout pinned to released tag**. The promote
   step reads `.trivyignore` from cwd; a `repository_dispatch`-triggered
   checkout defaults to the default branch's tip, which can drift between
   prerelease scan and promote scan if `.trivyignore` is edited on main
   while the release awaits approval. Added `ref: ${{
   github.event.client_payload.tag }}` to checkout.

2. **docker-publish.yml concurrency block added**. release.yml has its
   own concurrency, but docker-publish.yml is a separate workflow run.
   Two near-simultaneous publish-docker dispatches for the same release
   tag (e.g., a manual re-trigger after a transient Docker Hub 5xx) could
   interleave and have their cleanup-loop prefix-match deletions race
   each other. Group: `publish-docker-${{ github.event.client_payload.tag
   }}`, cancel-in-progress: false.

3. **publish.yml's frontend builder bumped from Node 20 → 24** to match
   `package.json`'s `engines: { node: ">=24.0.0" }`. Mismatched Node
   versions across the PyPI build (Node 20) and the Docker image (Node
   24, installed via NodeSource) could resolve transitive deps differently
   and ship frontend assets that fail at runtime. Pinned to specific
   `node:24-alpine` SHA.

4. **HEALTHCHECK no longer leaks Python processes**. The old
   `urllib.request.urlopen(...)` had no Python-level timeout, so a
   hung-but-alive backend would freeze the probe until Docker's outer
   timeout SIGKILL'd it — leaving a Python process per probe interval
   leaking PIDs/FDs over time. Added `timeout=5` and an explicit `r.status
   == 200` check so non-200 2xx responses (e.g., from misconfigured
   proxies) don't pass.

5. **Removed broken `VOLUME /scripts/`**. /scripts is image content (the
   ollama entrypoint baked in by the layer below the VOLUME directive),
   not user state. A VOLUME on an image-populated path causes anonymous-
   volume accumulation on every `docker run` and silently shadows the
   script if a user ever bind-mounts it.

6. **Added `VOLUME /data`** so users who don't bind-mount don't silently
   lose research data + encrypted DBs on `docker rm`. The entrypoint
   creates the persistent state at /data/{logs,cache,encrypted_databases},
   but without VOLUME the directory is part of the writable image layer.

7. **Stale comment in release.yml** (the SBOM download step) updated —
   no longer mentions the manifest-level SBOM that was dropped in
   commit 33d69b4e4.

Plus one comment update per user request:
8. **`apt-get upgrade -y` rationale comment** added at the
   build-once-promote section of the Dockerfile (top stage), and
   cross-referenced from the other two `apt-get upgrade` sites
   (ldr-test stage and runtime stage). Documents that the trade-off of
   bit-for-bit reproducibility for always-fresh CVE patches is
   intentional, and explains how build-once-promote mitigates the
   reproducibility loss.

* ci(release): clean up per-arch cosign attestation orphans on rejection

Round 9 found that the per-arch SBOM attestations introduced in commit
11e702f7d (the multi-arch SBOM fix) live at
`sha256-<per-arch-digest>.{sig,att,sbom}` keyed by the PER-ARCH manifest
digests, not the manifest-list digest. The cleanup-on-rejection job only
knew the manifest-list digest, so on rejection paths the per-arch
attestation artifacts were left orphaned on Docker Hub forever — and
unreachable through any tag, since the per-arch leaf tags were also
deleted.

Fix: before deleting the manifest tag, inspect it via `imagetools inspect
--raw` to discover the per-arch digests, then queue per-arch
`{sig,att,sbom}` deletions alongside the manifest-level cleanup. If the
manifest tag doesn't exist (e.g., build failed before manifest creation),
log a clear warning and proceed — the per-arch artifacts wouldn't have
been created in that case anyway.

* ci(release): drop prerelease env gate — use single release approval

The `prerelease` environment approval was a holdover from when prerelease
docker was a SEPARATE test build alongside the release build (two
distinct artifacts, two distinct decisions). In the build-once-promote
model the "prerelease" image IS the release image (just under a
different tag), so gating the BUILD with a human approval is redundant —
the only meaningful decision is whether the tested image becomes the
official release.

Changes:
- Remove the `approval-gate` sentinel job in prerelease-docker.yml.
- Drop `needs: [approval-gate]` from build-amd64, build-arm64, and
  security-scan. They now run automatically once release.yml's security
  + CI gates pass.
- Update workflow comments in release.yml and prerelease-docker.yml to
  reflect the single-gate flow.
- Update RELEASE_GUIDE.md "Approval and Publishing" section: now
  describes ONE `release` env approval, not two.
- Update CI_CD_INFRASTRUCTURE.md row for prerelease-docker.yml.

The cleanup-on-rejection job is unchanged — its triggers still fire
correctly on prerelease-docker `failure`/`cancelled` (build/sign/attest
errors) and on create-release / trigger-workflows `failure`/`cancelled`
(release env rejection). One fewer rejection path to consider, but the
mechanism is the same.

Operational benefits:
- One fewer approval click per release
- One fewer GitHub Environment to create as a pre-merge setup step
  (no more "create the `prerelease` env in Settings before merging")
- Build completes during/after security gates, so the prerelease tag is
  ready by the time the maintainer is ready to test

* ci(docker-publish): group GITHUB_OUTPUT writes (shellcheck SC2129)

CI's actionlint hook (which runs shellcheck on workflow run blocks)
flagged the 'Determine release tags' step for issuing five sequential
`echo ... >> "$GITHUB_OUTPUT"` redirects. Grouped them into a single
braced block + one redirect, per SC2129's recommendation.

* docs(release): correct approval flow after env-scoped secrets merge

After merging main, prerelease-docker.yml's four jobs declare
`environment: release` (PRs #3978/#3983) because DOCKER_USERNAME and
DOCKER_PASSWORD are env-scoped. That means the first `release` env
approval now gates the canonical build, not just the publish step —
the "automatic build then test then approve" flow described in earlier
docs no longer matches reality.

- RELEASE_GUIDE.md: rewrite the approval section to describe two
  release-env approvals (release.yml + docker-publish.yml) and the
  narrow Docker-only test window between them.
- CI_CD_INFRASTRUCTURE.md: update the prerelease-docker.yml row.
- release.yml: rewrite the `prerelease-docker:` job comment to reflect
  that this step is gated, not automatic, and explain why.

* ci(release): atomic publish ordering — GitHub Release runs last (#4044)

* ci(release): make GitHub Release publishing atomic with Docker + PyPI

Before this change, `create-release` published the public GitHub Release
BEFORE `docker-publish.yml` retagged and BEFORE `publish.yml` shipped to
PyPI. If either downstream failed, the public Release pointed at
non-existent artifacts.

This change closes that window:

- Convert `docker-publish.yml` from `repository_dispatch` to
  `workflow_call`. Its result is now visible to release.yml as
  `needs.publish-docker.result`, which lets:
  * `create-release` block on Docker promote success
  * `cleanup-on-rejection` safely scope cosign artifact deletion to
    cases where retag failed (after a successful retag, release tags
    share the prerelease manifest digest, so cosign artifacts must
    stay — deleting them would invalidate release-tag verification)
- Keep `publish.yml` on `repository_dispatch`. PyPI Trusted Publishing
  matches the OIDC `workflow_ref` claim against the CALLER when invoked
  via `workflow_call`, so a reusable publish.yml would fail with
  `invalid-publisher`. Tracked in pypa/gh-action-pypi-publish#166 and
  pypi/warehouse#11096.
- Restructure release.yml job graph:
    prerelease-docker → publish-docker (reusable) → trigger-pypi
      → monitor-pypi → create-release (LAST)
- Rewrite `cleanup-on-rejection` with a partial-retag rollback preamble.
  `imagetools create -t :VERSION -t :MAJOR_MINOR -t :latest` is a single
  process with multiple registry calls, so a mid-step failure can leave
  some release tags landed. The cleanup script now checks each release
  tag against Docker Hub and rolls back any that exist BEFORE deleting
  cosign signature/attestation artifacts.
- Slim `monitor-publish` → `monitor-pypi` (only watches publish.yml now;
  Docker is tracked natively via the inline job result).
- Drop the workflow-level `concurrency:` block from docker-publish.yml.
  As a reusable workflow it shares release.yml's run, and release.yml's
  caller-level concurrency on `github.ref` already serialises releases
  for the same tag.
- Update `docs/CI_CD_INFRASTRUCTURE.md` workflow-table rows and
  `docs/RELEASE_GUIDE.md` approval-flow section to describe the new
  ordering, plus a "Recovery from PyPI failure" section documenting the
  one remaining atomicity hole (PyPI fails after Docker success — Docker
  release tags exist, no PyPI, no GH Release; manual re-dispatch needed).

Plan + 5-agent Round 1 review notes saved separately.

* fix(release): plug blockers found in multi-round PR review

Four fixes against the atomicity refactor — two blockers that would
break the next release, two hardening items found while verifying them.

B1 (BLOCKING): docker-publish.yml checked out at `ref: inputs.tag`
(e.g. v1.6.11), but the v* git tag is created by `create-release`
which runs LAST in the job graph — after `publish-docker`. So on every
push-to-main triggered release (the documented primary path) the
checkout would fail with `fatal: couldn't find remote ref v1.6.11`.
Switch to `ref: github.sha`: same triggering commit the build and
prerelease-docker jobs used, exists at the moment publish-docker
runs for every event type, and still satisfies the original goal
of pinning .trivyignore to the scanned commit.

B2 (BLOCKING): cleanup-on-rejection referenced env-scoped
DOCKER_USERNAME / DOCKER_PASSWORD but had no `environment: release`,
so those secrets resolved to empty strings and the Docker Hub login
exited 1 — leaving the orphan tags + cosign artifacts the cleanup
was meant to remove. Add `environment: release`. The `release` env
approval was already granted upstream in the run, so no new prompt.

H1: monitor-pypi's `Wait for PyPI publish workflow to complete` step
piped `gh run list | jq ...` without `set -euo pipefail`, so a
transient gh failure (network, auth, rate limit) was swallowed by
jq returning empty input — burning the full 40-minute budget on
silent error rather than failing fast. Add `set -euo pipefail`.

H2: cleanup-on-rejection's step 2 did not delete the floating
`:prerelease` tag. If a release was rejected after prerelease-docker
re-pointed `:prerelease`, step 4 deleted the cosign signature for
that manifest while `:prerelease` still pointed at it — yielding a
window where pulling `:prerelease` returns an image the README
cosign-verify recipe cannot verify. Include `prerelease` in step 2's
delete loop; the next successful prerelease-docker re-creates it.

* chore(release): follow-up cleanups from PR review

Bundle of low-risk follow-ups from the multi-round review of this PR.
All same-scope as the atomicity refactor — staleness this PR introduced
in docs/comments, hardening adjacent to the changed code paths.

L1 (hardening): Drop `id-token: write` from `publish-docker` (caller)
and `docker-publish.yml` `promote` (callee). cosign VERIFY is a
read-only check against public Rekor/Fulcio; no GitHub OIDC token is
minted, so the permission is unused. Signing (which DOES need the
write) is exclusively in prerelease-docker.yml.

L7 (stale comments): prerelease-docker.yml's header comments still
referenced `trigger-workflows` — a job this PR split into
`publish-docker` + `trigger-pypi`. Replaced both occurrences.

L4 (doc): RELEASE_GUIDE.md "Emergency Procedures" claimed a manual
GitHub release "still triggers PyPI/Docker" — false under the new
design (publish.yml is repository_dispatch-only and docker-publish.yml
is workflow_call-only, neither listens on `release:` events). Replaced
with the actual recovery hierarchy.

L5 (doc): RELEASE_GUIDE.md and CI_CD_INFRASTRUCTURE.md pipeline chains
omitted the `provenance` job between `build` and `prerelease-docker`.

L6 (doc): RELEASE_GUIDE.md described monitor-pypi's timeout as a flat
"40 min" — the inner poll loop is 40 min but the surrounding
`timeout-minutes:` is 90 min, so the user-facing failure surface differs.

L4-bonus (doc): Manual-trigger section also claimed workflow_dispatch
takes "version and prerelease flag" inputs — release.yml's
`workflow_dispatch:` has no inputs defined. Replaced with the actual
behavior (reads __version__.py at HEAD; use tag-push for older versions).

M5 (doc): Both PAT_TOKEN comments overstated required scopes — claimed
`workflow` scope was needed (it isn't; it only governs editing
.github/workflows/ via the API) and didn't make explicit that
`public_repo` is rejected by `repository_dispatch`. Rewritten.

M8 (correctness): docker-publish.yml's cosign verify step targeted the
mutable `:VERSION` tag instead of `@${EXPECTED_DIGEST}`. The preceding
verify-promoted-tags step already confirms the tag resolves to the
expected digest, but using the tag here leaves a tag-resolution TOCTOU
window between the two steps. Trivy's re-scan already uses
`@${EXPECTED_DIGEST}`; switching cosign to the same reference is
consistent and races-free.

L2 (style): While editing the cosign step, routed `github.repository`
through an `env:` var (`REPO`) instead of direct `${{ }}` template
interpolation into shell args, matching the convention in the rest of
this workflow.

* chore(ci): bump harden-runner pin in docker-publish.yml to match other workflows

Last remaining v2.19.1 reference — every other workflow in this PR was
bumped to v2.19.3 when main moved forward. Auto-merge missed this one
because the surrounding hunk was in a conflict region.

* chore(release): fixes from multi-round subagent review of the full PR

Bundle of low-risk fixes confirmed by 30 subagents across 3 rounds.
None are blockers; all are worth fixing in-scope.

1. SLSA provenance builder.id: was github.workflow_ref, which inside a
   workflow_call callee resolves to the CALLER (release.yml), not the
   intended callee (prerelease-docker.yml). The Fulcio cert is still
   right (built from the job_workflow_ref OIDC claim), so cosign verify
   and slsa-verifier are unaffected, but raw-JSON consumers reading
   builder.id would see release.yml. Compose the value from
   github.repository + hardcoded path + github.ref instead — the `job`
   context has no workflow_ref property (actionlint confirms), and for
   a local-path workflow_call the callee's ref equals github.ref.

2. Dockerfile: set ENV LDR_DATA_DIR=/data so the VOLUME /data directive
   is actually load-bearing. Without it, paths.py falls back to
   platformdirs (~/.local/share/local-deep-research) which is inside the
   ephemeral container layer — bare docker run -v vol:/data users would
   silently lose data on docker rm.

3. trigger-pypi: forward prerelease=false in client_payload. publish.yml
   gates Test PyPI vs prod PyPI on client_payload.prerelease == true; if
   absent, the expression evaluates to '' and falls through to prod. Set
   false explicitly to remove the silent-fallback landmine.

4. Stale/misleading cosign comments in release.yml:
   - line 322: said "v2.6.0" while value is "v2.6.3" — corrected and
     noted GHSA-w6c6-c85g-mmv6 patch coverage
   - line 332: attributed --bundle to v3.0.2+ but it's been in v2.4.0+

5. release-gate.yml Node 20 → 24 (mirror publish.yml + Dockerfile).
   package.json declares engines.node >=24.0.0. The pip-install-check
   wheel is discarded so this was not a release-blocker, but the gate
   now validates the actual ship runtime.

6. README cosign-verify recipe:
   - Guard empty PLATFORM_DIGEST with a clear message for single-arch
     or pre-build-once-promote releases
   - Add docker buildx to prerequisites list
   - Spell out the legacy-verification substitution explicitly

* fix(ci): pin Trivy in promote step via SHA-pinned action wrapper

AI reviewer flagged docker-publish.yml's promote step as installing Trivy
via `sudo apt-get install -y trivy` with no version pin, reintroducing a
supply-chain risk to the release path. The prerelease scan in
prerelease-docker.yml uses the SHA-pinned aquasecurity/trivy-action
@ed142fd... wrapper with `version: 'v0.69.2'`, but the promote step
switched to the bare CLI and lost that protection.

Replace the apt-get install + raw `trivy image` invocation with the same
pinned action wrapper. Same scan semantics (CRITICAL,HIGH, ignore-unfixed,
.trivyignore, exit-code 1), same binary version (v0.69.2), same action
SHA — keeps the two scans consistent and removes the unpinned apt path.

* fix(ci): pin Trivy in release.yml build job — same fix as docker-publish.yml

R4 review caught that the AI-reviewer-flagged unpinned Trivy install also
exists in release.yml's `build` job, and is STRICTLY WORSE there because
that job carries `id-token: write` (for cosign keyless signing of SBOMs).

The attack chain that was open:
1. Aqua apt-repo compromise OR MITM of the unpinned GPG-key fetch
2. Malicious `trivy fs` binary installed
3. Binary exfiltrates ACTIONS_ID_TOKEN_REQUEST_URL/TOKEN env vars,
   minting an OIDC token under repo:LearningCircuit/local-deep-research
4. Binary tampers with sbom-spdx.json / sbom-cyclonedx.json contents
5. Next step `Sign release artifacts with Sigstore` cosign-signs the
   tampered SBOM with a legitimate Sigstore cert → fraudulent SBOM
   attached to the GitHub release with valid signature

Replace with the SHA-pinned aquasecurity/trivy-action@ed142fd0... (same
pin as docker-publish.yml and prerelease-docker.yml) using scan-type=fs
for the filesystem scan, with `version: 'v0.69.2'` to pin the binary
itself. Two separate action invocations (one per output format) because
the action takes a single format per run.

Also removes the unpinned `gpg --dearmor` of an unverified-fingerprint
public key, which the prior comment misleadingly called "secure".

* fix(ci): use TRIVY_USERNAME/PASSWORD env vars for trivy-action auth

The trivy-action README prescribes TRIVY_USERNAME/TRIVY_PASSWORD env
vars as the supported Docker Hub auth path. Even though docker/login-
action already wrote ~/.docker/config.json earlier in the job (and Trivy
reads it as a fallback), there's documented fragility with docker.io
credential helpers (aquasecurity/trivy#432, aquasecurity/trivy#8385)
that surfaces specifically on registry-pull scans like this one (unlike
the prerelease scan which uses a locally-loaded image).

The fallback would probably work today since localdeepresearch/
local-deep-research is public — anonymous pull would succeed even
without auth — but rate-limiting on anonymous Docker Hub pulls is
aggressive and the documented credential-helper quirks are real. Adding
the env vars uses the action's prescribed auth path, with the same
DOCKER_USERNAME/DOCKER_PASSWORD secrets already passed in via
workflow_call. Zero-cost defense-in-depth.

2026-05-22 21:52:46 +02:00

12 KiB

Raw Blame History

CI/CD and Infrastructure Documentation

This document describes the continuous integration, security scanning, and development infrastructure used by the Local Deep Research project.

Overview

The project uses many GitHub Actions workflows and 20+ pre-commit hooks to ensure code quality, security, and reliability.

At-a-glance health: see docs/ci/workflow-status.md — an auto-generated dashboard with live badges for every workflow, surfacing disabled, manual-only, and stale (silently-failing) ones at the top. Regenerate with pdm run python scripts/generate_workflow_status.py.

┌─────────────────────────────────────────────────────────────────┐
│                        Developer Workflow                        │
├─────────────────────────────────────────────────────────────────┤
│  Local Development          │  Pull Request        │  Main/Dev  │
│  ─────────────────          │  ────────────        │  ────────  │
│  • Pre-commit hooks         │  • All tests         │  • Deploy  │
│  • Unit tests               │  • Security scans    │  • Publish │
│  • Linting                  │  • Code review       │  • Release │
└─────────────────────────────────────────────────────────────────┘

Pre-Commit Hooks

Pre-commit hooks run locally before each commit. Install with:

pre-commit install
pre-commit install-hooks

Standard Hooks

Hook	Purpose
`check-yaml`	Validate YAML syntax
`end-of-file-fixer`	Ensure files end with newline
`trailing-whitespace`	Remove trailing whitespace
`check-added-large-files`	Block files >1MB
`check-case-conflict`	Prevent case-sensitivity issues
`forbid-new-submodules`	Prevent git submodules

Security Hooks

Hook	Purpose
`gitleaks`	Detect secrets, API keys, passwords in code
`check-sensitive-logging`	Prevent logging of passwords, tokens, keys
`check-safe-requests`	Enforce SSRF-safe HTTP functions (`safe_get`, `safe_post`)
`check-url-security`	Validate URL handling in JavaScript (XSS prevention)
`file-whitelist-check`	Only allow approved file types
`check-image-pinning`	Require SHA256 digests for Docker images

Code Quality Hooks

Hook	Purpose
`ruff`	Python linter (with auto-fix)
`ruff-format`	Python formatter (Black-compatible)
`eslint`	JavaScript linter
`shellcheck`	Shell script linter
`actionlint`	GitHub Actions workflow validator
`custom-code-checks`	Loguru usage, UTC datetime, raw SQL detection

Project-Specific Hooks

Hook	Purpose
`check-env-vars`	Environment variables must use `SettingsManager`
`check-deprecated-db-connection`	Enforce per-user database connections
`check-ldr-db-usage`	Prevent shared `ldr.db` usage
`check-research-id-type`	`research_id` must be string/UUID, not int
`check-datetime-timezone`	All DateTime columns (models and migrations) must use `UtcDateTime` from `sqlalchemy_utc`
`check-session-context-manager`	Require context managers for DB sessions
`check-pathlib-usage`	Use `pathlib.Path` instead of `os.path`
`check-no-external-resources`	No external CDN/resource references
`check-css-class-prefix`	CSS classes must have `ldr-` prefix

GitHub Actions Workflows

Test Workflows

Workflow	Trigger	Purpose
`docker-tests.yml`	PR, push	Consolidated Docker tests: pytest + coverage, UI tests (51 Puppeteer tests), LLM tests, infrastructure tests (single Docker build shared across all jobs). Includes tests previously in critical-ui-tests, extended-ui-tests, metrics-analytics-tests, library-ui-tests, mobile-ui-tests, and news-tests workflows.
`e2e-research-test.yml`	PR, push	End-to-end research flow
`fuzz.yml`	Schedule	Fuzzing tests

Security Scanning

Workflow	Trigger	Purpose
`codeql.yml`	PR, push, schedule	GitHub CodeQL analysis
`semgrep.yml`	PR, push	Semgrep static analysis
`osv-scanner.yml`	PR, push, schedule	OSV vulnerability scanning (Python + npm)
`gitleaks.yml`	PR, push	Secret detection
`security-tests.yml`	PR, push	Security-focused test suite
`devskim.yml`	PR, push	Microsoft DevSkim analysis
`checkov.yml`	PR, push	Infrastructure-as-code scanning
`container-security.yml`	PR, push	Container vulnerability scanning
`hadolint.yml`	PR, push	Dockerfile linting
`owasp-zap-scan.yml`	Schedule	OWASP ZAP dynamic scanning
`retirejs.yml`	PR, push	JavaScript vulnerability scanning
`zizmor-security.yml`	PR, push	Additional security checks
`ossf-scorecard.yml`	Schedule	OpenSSF Scorecard
`security-headers-validation.yml`	PR, push	HTTP security headers
`security-file-write-check.yml`	PR, push	File write security
`npm-audit.yml`	PR, push	npm audit for JS dependencies

Dependency Management

Workflow	Trigger	Purpose
`dependency-review.yml`	PR	Review dependency changes
`update-dependencies.yml`	Schedule	Auto-update Python deps
`update-npm-dependencies.yml`	Schedule	Auto-update npm deps
`update-precommit-hooks.yml`	Schedule	Update pre-commit hooks
`validate-image-pinning.yml`	PR, push	Verify Docker image pins

UI/Accessibility

Workflow	Trigger	Purpose
`responsive-ui-tests-enhanced.yml`	PR, push	Responsive design tests

Build & Deploy

Workflow	Trigger	Purpose
`prerelease-docker.yml`	`workflow_call` from release.yml	Canonical multi-arch Docker build, cosign sign, SBOM/SLSA attestations. Jobs declare `environment: release` so the first `release` env approval gates the build (env-scoped Docker Hub secrets).
`docker-publish.yml`	`workflow_call` from release.yml	Retag prerelease manifest as `:1.6.9` / `:1.6` / `:latest` (gated by `release` env). No rebuild — registry-side metadata only. Inlined as a reusable workflow so its result is visible to downstream jobs in release.yml (lets create-release block on Docker success, lets cleanup-on-rejection safely scope cosign artifact deletion).
`docker-multiarch-test.yml`	PR, push	Multi-architecture build test
`publish.yml`	`repository_dispatch` from release.yml	Publish to PyPI. Stays on `repository_dispatch` (not `workflow_call`) because PyPI Trusted Publishing rejects OIDC claims from reusable workflows — `pypa/gh-action-pypi-publish#166`, `pypi/warehouse#11096`.
`release.yml`	Push to `main`, tag `v..*`, manual	Orchestrate release: gates → build → provenance → prerelease-docker → publish-docker → trigger-pypi → monitor-pypi → create-release (last)

Code Quality

Workflow	Trigger	Purpose
`pre-commit.yml`	PR, push	Run pre-commit hooks in CI
`mypy-type-check.yml`	PR, push	Python type checking
`ai-code-reviewer.yml`	PR	AI-assisted code review
`claude-code-review.yml`	PR	Claude-based code review

Repository Management

Workflow	Trigger	Purpose
`sync-main-to-dev.yml`	Push to main	Sync main branch to dev
`label-fixed-in-dev.yml`	Push to dev	Auto-label fixed issues
`danger-zone-alert.yml`	PR	Alert on sensitive file changes
`check-env-vars.yml`	PR, push	Environment variable validation
`file-whitelist-check.yml`	PR, push	File type validation
`version_check.yml`	PR, push	Version consistency check

Dependabot Configuration

Dependabot automatically creates PRs for dependency updates:

Ecosystem	Directories	Schedule
Python (pip)	`/`	Weekly (Monday 04:00)
npm	`/`, `/tests/*`	Weekly/Daily
GitHub Actions	`/`	Weekly
Docker	`/`	Daily

Coverage Reporting

Coverage reports are generated by the docker-tests.yml workflow (pytest-tests job):

HTML Report: Deployed to GitHub Pages at https://learningcircuit.github.io/local-deep-research/coverage/
PR Comments: Each PR receives a comment with coverage percentage
Badge: Coverage badge updated via GitHub Gist

Configuration in pyproject.toml:

[tool.coverage.run]
source = ["src"]
omit = ["*/tests/*", "*/migrations/*"]

[tool.coverage.report]
exclude_lines = ["pragma: no cover", "if TYPE_CHECKING:"]

Security Architecture

Supply Chain Security

Dependency Pinning: All GitHub Actions use SHA256 digests
Docker Image Pinning: All base images use SHA256 digests
Lock Files: pdm.lock and package-lock.json committed
Vulnerability Scanning: OSV-Scanner, npm audit, RetireJS

Runtime Security

SSRF Protection: safe_get(), safe_post(), SafeSession wrappers
XSS Prevention: DOMPurify for HTML sanitization
SQL Injection: SQLAlchemy ORM (no raw SQL)
Secret Management: Environment variables via SettingsManager

Container Security

Non-root User: Containers run as ldruser:1000
Minimal Base Image: Python slim images
Health Checks: Docker health check endpoints
Read-only Where Possible: Minimal write permissions

Running Tests Locally

Quick Test (Unit Tests Only)

pdm run pytest tests/test_settings_manager.py tests/test_utils.py -v

Full Test Suite

pdm run pytest tests/ --ignore=tests/ui_tests --ignore=tests/fuzz -v

With Coverage

pdm run pytest tests/ --cov=src --cov-report=html -v
open coverage/htmlcov/index.html

UI Tests (Requires Server)

# Terminal 1: Start server
pdm run ldr-web

# Terminal 2: Run UI tests
cd tests/ui_tests && npm test

Docker Testing

Build and run tests in Docker:

# Build test image
docker build --target ldr-test -t ldr-test .

# Run tests
docker run --rm -v "$PWD":/app -w /app ldr-test \
  pytest tests/ --ignore=tests/ui_tests -v

Environment Variables for CI

Variable	Purpose
`CI=true`	Indicates CI environment
`LDR_TESTING_WITH_MOCKS=true`	Enable test mocks
`LDR_DISABLE_RATE_LIMITING=true`	Disable HTTP rate limits in tests (canonical name). The legacy `DISABLE_RATE_LIMITING=true` is still honored but emits a deprecation warning. Distinct from `LDR_RATE_LIMITING_ENABLED`, which controls the adaptive search-engine rate limiter — different subsystem.

Adding New Workflows

When adding a new workflow:

Use pinned action versions with SHA256 digests
Add permissions: {} at top level (minimal permissions)
Add job-level permissions as needed
Include step-security/harden-runner step
Add workflow to this documentation

Example template:

name: New Workflow

on:
  pull_request:
    branches: [main]

permissions: {}

jobs:
  example:
    runs-on: ubuntu-latest
    permissions:
      contents: read

    steps:
      - name: Harden the runner
        uses: step-security/harden-runner@... # pinned
        with:
          egress-policy: audit

      - uses: actions/checkout@... # pinned
        with:
          persist-credentials: false

12 KiB Raw Blame History