* ci(release): build-once-promote refactor for Docker pipeline
Today the release pipeline builds the Docker image twice — once in
prerelease-docker.yml for "testing" and again in docker-publish.yml for
the actual release. The image you tested is not the image you ship: base
layer patches, transitive deps, and apt/pip resolution can diverge
between the two builds.
This refactor makes prerelease-docker.yml the canonical build and turns
docker-publish.yml into a thin retag step. `docker buildx imagetools
create` is a registry-side metadata operation that takes seconds and
preserves the manifest digest, so the released image is bit-identical
to the one tested. Cosign signatures, SBOM attestations, and SLSA
provenance are stored at sha256-<digest>.{sig,att} keyed by digest, so
signing once in prerelease covers the release tags transitively.
Pipeline shape changes:
- prerelease-docker.yml is now a reusable workflow (workflow_call) called
from release.yml. It builds, scans (Trivy), signs (cosign), attests
the SBOM (cosign attest --type spdxjson, replacing the deprecated
cosign attach sbom), and emits SLSA provenance. The manifest_digest is
exposed as a workflow output. The `prerelease` environment gates the
first build job for human approval.
- docker-publish.yml shrinks from ~457 to ~250 lines. It receives source_tag
and expected_digest in the dispatch payload, verifies the source digest
before retag, retags via imagetools create, verifies the digest is
preserved (defense against re-encoding), re-runs Trivy against the
digest (catches CVE-DB updates between prerelease and promote),
verifies the cosign signature transitivity, and runs the existing
prerelease cleanup loop.
- release.yml adds prerelease-docker to create-release.needs and
trigger-workflows.needs, so the GitHub Release and the publish dispatch
only happen after the canonical Docker build completes. The dispatch
payload now carries source_tag and expected_digest. A new
cleanup-on-rejection job removes orphan prerelease tags and cosign
artifacts when the release is rejected (without it, every rejection
would leave dangling sha256-<digest>.{sig,att} on Docker Hub).
- README cosign verify example updated to the keyless invocation users
actually need (identity regex pointing at prerelease-docker.yml,
--certificate-oidc-issuer, --certificate-github-workflow-repository),
plus the SBOM verify-attestation command.
Notable design decisions (verified across multiple subagent review
rounds):
- SLSA provenance entryPoint stays as release.yml (the top-level caller).
Per the SLSA GHA buildtype v1 spec and the canonical
slsa-github-generator behavior, reusable workflows are explicitly NOT
entryPoints — pointing at prerelease-docker.yml would break verifier
policies that allowlist trigger workflows.
- Cosign cert identity for verification matches Fulcio's SAN URI, which
is built from job_workflow_ref — the CALLEE for reusable workflows. So
the identity regex matches prerelease-docker.yml even though the build
is invoked from release.yml. Hardened with escaped dots, refs/(heads|tags)/
constraint, and --certificate-github-workflow-repository to defend
against the reusable-workflow-identity-reuse class of attacks.
- cleanup-on-rejection uses an allowlist if (failure || cancelled), not
a denylist (!= 'success'), to avoid firing on `skipped` (e.g. when
release_exists short-circuits the run). It also fails loudly on 401/403
from the Docker Hub API so a missing Delete scope on the PAT can't
silently let orphans accumulate.
Supersedes #3969 (split-environment): the env split is preserved by the
new structure — prerelease env on the called workflow's first job,
release env on create-release/trigger-workflows.
Pre-merge checklist for the maintainer:
- Create the `prerelease` environment in GitHub Settings with the same
required reviewers as `release`. Without it, the called workflow's
approval gate auto-creates the env with no protection rules and
silently approves the build.
- Verify DOCKER_USERNAME / DOCKER_PASSWORD remain repo-level secrets
(they currently are). Environment-scoped secrets do not propagate
across reusable workflow calls except via the called job's own
environment.
* ci(release): fixes from multi-round subagent review
Round 1 surfaced 14 candidate findings; Round 2 verified 7 as real bugs and
refuted 4 as false positives. This commit applies the verified fixes.
CONFIRMED bugs fixed:
1. **Approval gate was per-job, not workflow-wide.** The previous
`environment: prerelease` on `build-amd64` only let `build-arm64` and
`security-scan` run pre-approval (GitHub environments are
job-scoped per docs + community/discussions/174381). Replaced with a
sentinel `approval-gate` job that all three build jobs `needs:`. Single
approval click still gates everything, but now actually blocks all
parallel jobs.
2. **`cleanup-on-rejection` if-condition missed the prerelease-rejection
path.** When prerelease-docker.result was `failure`, both create-release
and trigger-workflows became `skipped` (their `if:` requires success),
and the cleanup `if:` only fired on `failure`/`cancelled` of dependents.
Added explicit `prerelease-docker.result == 'failure'` clause so the
most common rejection path actually triggers cleanup.
3. **Trivy re-scan ran AFTER retag.** A failing scan would leave release
tags `:1.6.9`, `:1.6`, `:latest` publicly published with no rollback.
Reordered: scan source digest BEFORE retag. Content is bit-identical
(same digest), so scanning the prerelease tag tests what would be
promoted — but failure now leaves no public broken tags. Also moved
cosign verify before retag for the same reason.
4. **Trivy only scanned linux/amd64 by default** against a manifest list
digest (per Trivy docs + aquasecurity/trivy#7847). Replaced single scan
with two explicit per-platform invocations
(`--platform linux/amd64`, then `linux/arm64`) so arm64 layers are also
gated by the freshness check.
5. **Trivy DB freshness wasn't guaranteed.** apt-installed Trivy may use a
stale embedded DB. Added explicit `trivy image --download-db-only`
before the scans so the CVE-DB freshness window the re-scan exists for
is actually exercised.
6. **`cosign attest` re-runs accumulated attestation layers** (verified via
cosign 2.x `mutate.go` `dedupeAndReplace`). Added `--replace` to both
attest calls (SLSA provenance + SBOM). Sigstore spec allows multi-sig
so `cosign sign` is left as-is.
7. **SLSA provenance values inherited from old code were misleading.**
- `builder.id`: changed from `https://github.com/actions/runner` (the
agent binary) to the workflow ref the build is actually defined in
(per SLSA v0.2 spec — builder.id should be a verifiable trust root).
- `completeness.{parameters,environment,materials}`: flipped from
`true` to `false`. The predicate captures no workflow_call inputs,
no environment, and the build does network I/O — claiming
completeness was a public signed false statement.
- `buildInvocationId`: now includes `${run_id}-${run_attempt}` so
re-runs are distinguishable.
REFUTED (kept as-is, with confidence):
- `imagetools create` does NOT change the digest in this case. Buildx's
Combine() in util/imagetools/create.go has an explicit short-circuit
for single-source manifest-list inputs that returns the bytes
byte-for-byte (no annotations + same registry required, both true here).
- Concurrent rejection digest collision is not a real concern — Docker
builds in this pipeline are not bit-deterministic (apt, network, file
timestamps, default provenance attestations all vary).
- The `prerelease-v1.6.9-*` cleanup pattern does NOT collide with
`prerelease-v1.6.91-*` (trailing dash in the prefix disambiguates).
- Reusable-workflow approval prompts appear inline on the caller run
page for single-level calls — not a UX regression.
* ci(release): revert most Round 2 review additions
Keep the build-once-promote refactor's structural shape but back out the
defensive additions from commit 68606b299:
- approval-gate sentinel job → revert to `environment: prerelease` on
build-amd64 only
- SLSA builder.id, completeness flags, buildInvocationId → revert to
inherited values from the previous docker-publish.yml
- `cosign attest --replace` → drop, accept default append behavior
- Pre-promote Trivy + multi-platform scans + db refresh + pre-promote
cosign verify → revert to single post-promote scan and post-promote
cosign verify
- cleanup-on-rejection if-condition → drop the
`prerelease-docker.result == 'failure'` allowlist clause
Rationale: keep the change set minimal vs main. The defensive additions
were correct in isolation but expand scope of this PR.
* fix(ci): drop invalid --trivyignores flag from raw trivy CLI invocation
The Round 2 promote step used `--trivyignores .trivyignore`, which is the
INPUT name of the aquasecurity/trivy-action wrapper, not a flag of the
raw Trivy binary. The CLI accepts only `--ignorefile` (singular) and
auto-loads `.trivyignore` from cwd by default.
As-was, every release run would hard-fail with `unknown flag:
--trivyignores` from cobra/pflag before any scanning occurred. Removing
the flag is sufficient — Trivy auto-loads the ignorefile from the
checkout root.
prerelease-docker.yml is unaffected: it uses the action wrapper with
`trivyignores: '.trivyignore'` as input, which IS correct usage for the
action layer (it translates to --ignorefile internally via
TRIVY_IGNOREFILE).
Sources:
- https://trivy.dev/latest/docs/references/configuration/cli/trivy_image/
- https://github.com/aquasecurity/trivy-action/blob/master/action.yaml
* ci(release): apply remaining bugs from multi-round review
After Round 4 verification confirmed several deferred findings, applying
the bug fixes the user explicitly requested:
1. Re-introduce the `approval-gate` sentinel job in prerelease-docker.yml.
GitHub Actions environments are job-scoped, so without a gate sentinel
`build-arm64` and `security-scan` would run pre-approval — pushing
the `-arm64` per-arch tag and consuming Trivy minutes regardless of
whether the maintainer approved or rejected the gate. Single approval
click still gates everything via `needs: [approval-gate]`.
2. Fix the SLSA `builder.id` to use `${{ github.workflow_ref }}` instead
of the inherited `https://github.com/actions/runner` agent identity.
`workflow_ref` resolves to the canonical
`<owner>/<repo>/.github/workflows/<file>.yml@<callee-ref>` format that
matches slsa-github-generator's output and that verifier policies can
pin against.
3. Flip SLSA `completeness.{parameters,environment,materials}` from
`true` to `false`. The predicate captures no workflow_call inputs, no
environment, and the build does network I/O — claiming completeness
was a public signed false statement.
4. Add `${{ github.run_attempt }}` to the SLSA `buildInvocationId` so
"Re-run failed jobs" attempts are distinguishable.
5. Expand `cleanup-on-rejection` `if:` to include
`prerelease-docker.result == 'failure'` and `'cancelled'`. Without
these clauses, the most common rejection path (env approval rejected
for prerelease) leaves dependents `skipped`, which the existing
allowlist doesn't match — orphan tags persist on Docker Hub forever.
6. Drop unused `packages: write` from both the called workflow and the
caller's reusable-workflow block. Docker Hub auth uses
DOCKER_PASSWORD, not GITHUB_TOKEN; `packages: write` only matters for
ghcr.io which the project doesn't use.
7. Update `docs/CI_CD_INFRASTRUCTURE.md` Build & Deploy table to reflect
the build-once-promote split.
8. Update `docs/RELEASE_GUIDE.md` "Automatic Publishing" section to
describe both approval gates (`prerelease` and `release`).
* ci(release): R5/R6 review fixes — cosign pin, multi-arch SBOM, orphan SBOM
Round 5 (10 agents) and Round 6 (5 agents debunking) verified these
findings, all of which are now applied:
1. **Pin cosign to v2.6.0**. R6A2 verified that `sigstore/cosign-installer@v4.1.2`
ships cosign v3.0.6 by default. cosign v3 enables `--new-bundle-format`
ON BY DEFAULT, which changes the on-wire signature/attestation format.
Mismatched version across sign/verify works in-pipeline (both on v3),
but downstream verifiers running the README cosign-verify recipe on v2
would fail. Pinning all three cosign-installer steps to v2.6.0 keeps
the legacy tag-based sigstore format until we deliberately migrate
the entire ecosystem.
2. **Multi-arch SBOM via per-arch attestations**. R6A3 verified the claim
(anchore/syft#1708, actions/attest-sbom#60): syft against a manifest
list digest only scans the host platform's layers. The previous SBOM
attestation against the manifest digest claimed to describe both
amd64 + arm64 but actually only enumerated amd64. ARM64 consumers
were verifying a misleading SBOM. Fix: iterate over manifest entries
from `imagetools inspect --raw`, run `syft --platform <plat>` against
each per-arch digest, and `cosign attest --replace --type spdxjson`
each per-arch SBOM against the per-arch digest. ALSO keep a
manifest-list-level SBOM (host arch only) so end-users running
`cosign verify-attestation user/img:latest` don't get an empty result.
3. **Re-add `--replace` to cosign attest** (both SLSA and SPDX). R5A7's
deeper analysis enumerated specific failure modes beyond cosmetic
clutter: Kyverno `count: 1` policies, registry layer count caps,
audit ambiguity (verify returns success on first matching layer),
Rekor entry bloat. R3A5 already confirmed `--replace` is per-
predicate-type, so SLSA and SPDX attestations don't disturb each
other.
4. **Container-image SBOM no longer orphaned**. R6A4 verified: the
Syft-produced container SBOMs were uploaded as artifact `sbom` from
prerelease-docker.yml but never downloaded by `create-release` — they
were invisible on the GitHub Release page. Fix: download the `sbom`
artifact, rename to `sbom-container-*` to disambiguate from the
filesystem `sbom-spdx.json`, and attach to `gh release create`.
5. **Narrow `secrets: inherit` to explicit secrets**. R5A3 flagged that
`secrets: inherit` propagates ALL repo secrets (PAT_TOKEN,
OPENROUTER_API_KEY, SERPER_API_KEY, GITHUB_TOKEN) into a workflow
that only needs Docker Hub creds. Replaced with explicit
`DOCKER_USERNAME` + `DOCKER_PASSWORD` mapping; the called workflow
now declares these as required `workflow_call.secrets`.
6. **Drop unused `DEPS_HASH` build-arg**. R5A2 confirmed it was declared
in the Dockerfile but never referenced in any RUN/COPY, so it never
busted the Docker layer cache. Cache invalidation already happens
correctly via `COPY pdm.lock` (file content hash). Removed the ARG
declaration from Dockerfile and the three `build-args:` passes from
prerelease-docker.yml.
R6 also REFUTED two earlier claims:
- R5A8's concurrency claim: reusable workflows DO share the caller's
`workflow_run` and concurrency group (R3A8 was correct). Don't add a
`concurrency:` block to prerelease-docker.yml — would create a
separate group and re-introduce the race R5A8 imagined.
- R5A10's harden-runner CVE claim: v2.19.1 (used here) is well after
the fix versions for both CVE-2026-32946 (v2.16.0) and CVE-2026-25598
(v2.14.2). No bump needed.
* ci(release): R7 fixes — cosign v2.6.3, drop misleading manifest-level SBOM
Round 7 (5 agents) verified the R5/R6 fixes and surfaced two real bugs:
1. **cosign-installer pinned cosign v2.6.0**, which has two known security
advisories: GHSA-whqx-f9j3-ch6m (fixed in v2.6.2) and GHSA-w6c6-c85g-mmv6
(fixed in v2.6.3). Bumped pin to v2.6.3 in all three workflow files so
the install step picks up the fixes. Same minor (v2.6.x), so no flag
drift — `--replace`, `--type`, `--bundle`, `--certificate-*` all behave
identically.
2. **The manifest-level SBOM attestation was misleading**. The previous
step ran `syft <repo>@<manifest-list-digest>` on an amd64 runner,
which (per anchore/syft#1708) only enumerates amd64 layers. The SBOM
was then attested at the manifest-list digest where it was discoverable
by ALL platform consumers — so an arm64 user verifying `:latest` would
receive a signed SBOM that lies about the layers they actually pulled.
The per-arch loop already produces accurate per-platform SBOMs; the
manifest-level fallback only re-introduced the lie for UX convenience.
Dropped the manifest-level attest call entirely. Per-arch SBOMs are the
only honest representation. Updated the README's `cosign
verify-attestation` recipe to resolve to the per-platform digest first
(using `jq` over `imagetools inspect --raw`), so end-users on either
architecture get the SBOM that actually describes what they pulled.
Removed `sbom.spdx.json` from the workflow artifact + release-staging
logic since it no longer exists.
3. **Empty-loop assertion**: added a defensive count check before the
per-arch SBOM loop. If a future buildx output change ever produced
zero per-arch entries (e.g., all entries marked architecture: unknown),
the previous code would silently skip the loop and pass CI green with
no SBOMs. Now it fails loud with the raw manifest dumped for debugging.
Note on round-7 reviewer's other concerns:
- "Pipe-to-while subshell scope": confirmed safe. set -euo pipefail
inherited; failures in syft/cosign attest abort the subshell, and
pipefail propagates to the outer step.
- "imagetools inspect --raw stability": OCI image-index spec is stable
for ~7 years. The jq filter handles the BuildKit attestation pseudo-
entries via `architecture != "unknown"`.
- "harden-runner v2.19.1 CVEs": false alarm. v2.19.1 is well above the
fix versions (v2.16.0, v2.14.2). No bump needed.
* ci(release): R8 fixes from 8th review round
Round 8 (5 agents covering Dockerfile, npm/Vite, runtime image, edge
cases, and post-fix smoke check) surfaced 7 real bugs the previous 7
rounds missed. All fixed here, plus a comment per user request.
1. **docker-publish.yml checkout pinned to released tag**. The promote
step reads `.trivyignore` from cwd; a `repository_dispatch`-triggered
checkout defaults to the default branch's tip, which can drift between
prerelease scan and promote scan if `.trivyignore` is edited on main
while the release awaits approval. Added `ref: ${{
github.event.client_payload.tag }}` to checkout.
2. **docker-publish.yml concurrency block added**. release.yml has its
own concurrency, but docker-publish.yml is a separate workflow run.
Two near-simultaneous publish-docker dispatches for the same release
tag (e.g., a manual re-trigger after a transient Docker Hub 5xx) could
interleave and have their cleanup-loop prefix-match deletions race
each other. Group: `publish-docker-${{ github.event.client_payload.tag
}}`, cancel-in-progress: false.
3. **publish.yml's frontend builder bumped from Node 20 → 24** to match
`package.json`'s `engines: { node: ">=24.0.0" }`. Mismatched Node
versions across the PyPI build (Node 20) and the Docker image (Node
24, installed via NodeSource) could resolve transitive deps differently
and ship frontend assets that fail at runtime. Pinned to specific
`node:24-alpine` SHA.
4. **HEALTHCHECK no longer leaks Python processes**. The old
`urllib.request.urlopen(...)` had no Python-level timeout, so a
hung-but-alive backend would freeze the probe until Docker's outer
timeout SIGKILL'd it — leaving a Python process per probe interval
leaking PIDs/FDs over time. Added `timeout=5` and an explicit `r.status
== 200` check so non-200 2xx responses (e.g., from misconfigured
proxies) don't pass.
5. **Removed broken `VOLUME /scripts/`**. /scripts is image content (the
ollama entrypoint baked in by the layer below the VOLUME directive),
not user state. A VOLUME on an image-populated path causes anonymous-
volume accumulation on every `docker run` and silently shadows the
script if a user ever bind-mounts it.
6. **Added `VOLUME /data`** so users who don't bind-mount don't silently
lose research data + encrypted DBs on `docker rm`. The entrypoint
creates the persistent state at /data/{logs,cache,encrypted_databases},
but without VOLUME the directory is part of the writable image layer.
7. **Stale comment in release.yml** (the SBOM download step) updated —
no longer mentions the manifest-level SBOM that was dropped in
commit 33d69b4e4.
Plus one comment update per user request:
8. **`apt-get upgrade -y` rationale comment** added at the
build-once-promote section of the Dockerfile (top stage), and
cross-referenced from the other two `apt-get upgrade` sites
(ldr-test stage and runtime stage). Documents that the trade-off of
bit-for-bit reproducibility for always-fresh CVE patches is
intentional, and explains how build-once-promote mitigates the
reproducibility loss.
* ci(release): clean up per-arch cosign attestation orphans on rejection
Round 9 found that the per-arch SBOM attestations introduced in commit
11e702f7d (the multi-arch SBOM fix) live at
`sha256-<per-arch-digest>.{sig,att,sbom}` keyed by the PER-ARCH manifest
digests, not the manifest-list digest. The cleanup-on-rejection job only
knew the manifest-list digest, so on rejection paths the per-arch
attestation artifacts were left orphaned on Docker Hub forever — and
unreachable through any tag, since the per-arch leaf tags were also
deleted.
Fix: before deleting the manifest tag, inspect it via `imagetools inspect
--raw` to discover the per-arch digests, then queue per-arch
`{sig,att,sbom}` deletions alongside the manifest-level cleanup. If the
manifest tag doesn't exist (e.g., build failed before manifest creation),
log a clear warning and proceed — the per-arch artifacts wouldn't have
been created in that case anyway.
* ci(release): drop prerelease env gate — use single release approval
The `prerelease` environment approval was a holdover from when prerelease
docker was a SEPARATE test build alongside the release build (two
distinct artifacts, two distinct decisions). In the build-once-promote
model the "prerelease" image IS the release image (just under a
different tag), so gating the BUILD with a human approval is redundant —
the only meaningful decision is whether the tested image becomes the
official release.
Changes:
- Remove the `approval-gate` sentinel job in prerelease-docker.yml.
- Drop `needs: [approval-gate]` from build-amd64, build-arm64, and
security-scan. They now run automatically once release.yml's security
+ CI gates pass.
- Update workflow comments in release.yml and prerelease-docker.yml to
reflect the single-gate flow.
- Update RELEASE_GUIDE.md "Approval and Publishing" section: now
describes ONE `release` env approval, not two.
- Update CI_CD_INFRASTRUCTURE.md row for prerelease-docker.yml.
The cleanup-on-rejection job is unchanged — its triggers still fire
correctly on prerelease-docker `failure`/`cancelled` (build/sign/attest
errors) and on create-release / trigger-workflows `failure`/`cancelled`
(release env rejection). One fewer rejection path to consider, but the
mechanism is the same.
Operational benefits:
- One fewer approval click per release
- One fewer GitHub Environment to create as a pre-merge setup step
(no more "create the `prerelease` env in Settings before merging")
- Build completes during/after security gates, so the prerelease tag is
ready by the time the maintainer is ready to test
* ci(docker-publish): group GITHUB_OUTPUT writes (shellcheck SC2129)
CI's actionlint hook (which runs shellcheck on workflow run blocks)
flagged the 'Determine release tags' step for issuing five sequential
`echo ... >> "$GITHUB_OUTPUT"` redirects. Grouped them into a single
braced block + one redirect, per SC2129's recommendation.
* docs(release): correct approval flow after env-scoped secrets merge
After merging main, prerelease-docker.yml's four jobs declare
`environment: release` (PRs #3978/#3983) because DOCKER_USERNAME and
DOCKER_PASSWORD are env-scoped. That means the first `release` env
approval now gates the canonical build, not just the publish step —
the "automatic build then test then approve" flow described in earlier
docs no longer matches reality.
- RELEASE_GUIDE.md: rewrite the approval section to describe two
release-env approvals (release.yml + docker-publish.yml) and the
narrow Docker-only test window between them.
- CI_CD_INFRASTRUCTURE.md: update the prerelease-docker.yml row.
- release.yml: rewrite the `prerelease-docker:` job comment to reflect
that this step is gated, not automatic, and explain why.
* ci(release): atomic publish ordering — GitHub Release runs last (#4044)
* ci(release): make GitHub Release publishing atomic with Docker + PyPI
Before this change, `create-release` published the public GitHub Release
BEFORE `docker-publish.yml` retagged and BEFORE `publish.yml` shipped to
PyPI. If either downstream failed, the public Release pointed at
non-existent artifacts.
This change closes that window:
- Convert `docker-publish.yml` from `repository_dispatch` to
`workflow_call`. Its result is now visible to release.yml as
`needs.publish-docker.result`, which lets:
* `create-release` block on Docker promote success
* `cleanup-on-rejection` safely scope cosign artifact deletion to
cases where retag failed (after a successful retag, release tags
share the prerelease manifest digest, so cosign artifacts must
stay — deleting them would invalidate release-tag verification)
- Keep `publish.yml` on `repository_dispatch`. PyPI Trusted Publishing
matches the OIDC `workflow_ref` claim against the CALLER when invoked
via `workflow_call`, so a reusable publish.yml would fail with
`invalid-publisher`. Tracked in pypa/gh-action-pypi-publish#166 and
pypi/warehouse#11096.
- Restructure release.yml job graph:
prerelease-docker → publish-docker (reusable) → trigger-pypi
→ monitor-pypi → create-release (LAST)
- Rewrite `cleanup-on-rejection` with a partial-retag rollback preamble.
`imagetools create -t :VERSION -t :MAJOR_MINOR -t :latest` is a single
process with multiple registry calls, so a mid-step failure can leave
some release tags landed. The cleanup script now checks each release
tag against Docker Hub and rolls back any that exist BEFORE deleting
cosign signature/attestation artifacts.
- Slim `monitor-publish` → `monitor-pypi` (only watches publish.yml now;
Docker is tracked natively via the inline job result).
- Drop the workflow-level `concurrency:` block from docker-publish.yml.
As a reusable workflow it shares release.yml's run, and release.yml's
caller-level concurrency on `github.ref` already serialises releases
for the same tag.
- Update `docs/CI_CD_INFRASTRUCTURE.md` workflow-table rows and
`docs/RELEASE_GUIDE.md` approval-flow section to describe the new
ordering, plus a "Recovery from PyPI failure" section documenting the
one remaining atomicity hole (PyPI fails after Docker success — Docker
release tags exist, no PyPI, no GH Release; manual re-dispatch needed).
Plan + 5-agent Round 1 review notes saved separately.
* fix(release): plug blockers found in multi-round PR review
Four fixes against the atomicity refactor — two blockers that would
break the next release, two hardening items found while verifying them.
B1 (BLOCKING): docker-publish.yml checked out at `ref: inputs.tag`
(e.g. v1.6.11), but the v* git tag is created by `create-release`
which runs LAST in the job graph — after `publish-docker`. So on every
push-to-main triggered release (the documented primary path) the
checkout would fail with `fatal: couldn't find remote ref v1.6.11`.
Switch to `ref: github.sha`: same triggering commit the build and
prerelease-docker jobs used, exists at the moment publish-docker
runs for every event type, and still satisfies the original goal
of pinning .trivyignore to the scanned commit.
B2 (BLOCKING): cleanup-on-rejection referenced env-scoped
DOCKER_USERNAME / DOCKER_PASSWORD but had no `environment: release`,
so those secrets resolved to empty strings and the Docker Hub login
exited 1 — leaving the orphan tags + cosign artifacts the cleanup
was meant to remove. Add `environment: release`. The `release` env
approval was already granted upstream in the run, so no new prompt.
H1: monitor-pypi's `Wait for PyPI publish workflow to complete` step
piped `gh run list | jq ...` without `set -euo pipefail`, so a
transient gh failure (network, auth, rate limit) was swallowed by
jq returning empty input — burning the full 40-minute budget on
silent error rather than failing fast. Add `set -euo pipefail`.
H2: cleanup-on-rejection's step 2 did not delete the floating
`:prerelease` tag. If a release was rejected after prerelease-docker
re-pointed `:prerelease`, step 4 deleted the cosign signature for
that manifest while `:prerelease` still pointed at it — yielding a
window where pulling `:prerelease` returns an image the README
cosign-verify recipe cannot verify. Include `prerelease` in step 2's
delete loop; the next successful prerelease-docker re-creates it.
* chore(release): follow-up cleanups from PR review
Bundle of low-risk follow-ups from the multi-round review of this PR.
All same-scope as the atomicity refactor — staleness this PR introduced
in docs/comments, hardening adjacent to the changed code paths.
L1 (hardening): Drop `id-token: write` from `publish-docker` (caller)
and `docker-publish.yml` `promote` (callee). cosign VERIFY is a
read-only check against public Rekor/Fulcio; no GitHub OIDC token is
minted, so the permission is unused. Signing (which DOES need the
write) is exclusively in prerelease-docker.yml.
L7 (stale comments): prerelease-docker.yml's header comments still
referenced `trigger-workflows` — a job this PR split into
`publish-docker` + `trigger-pypi`. Replaced both occurrences.
L4 (doc): RELEASE_GUIDE.md "Emergency Procedures" claimed a manual
GitHub release "still triggers PyPI/Docker" — false under the new
design (publish.yml is repository_dispatch-only and docker-publish.yml
is workflow_call-only, neither listens on `release:` events). Replaced
with the actual recovery hierarchy.
L5 (doc): RELEASE_GUIDE.md and CI_CD_INFRASTRUCTURE.md pipeline chains
omitted the `provenance` job between `build` and `prerelease-docker`.
L6 (doc): RELEASE_GUIDE.md described monitor-pypi's timeout as a flat
"40 min" — the inner poll loop is 40 min but the surrounding
`timeout-minutes:` is 90 min, so the user-facing failure surface differs.
L4-bonus (doc): Manual-trigger section also claimed workflow_dispatch
takes "version and prerelease flag" inputs — release.yml's
`workflow_dispatch:` has no inputs defined. Replaced with the actual
behavior (reads __version__.py at HEAD; use tag-push for older versions).
M5 (doc): Both PAT_TOKEN comments overstated required scopes — claimed
`workflow` scope was needed (it isn't; it only governs editing
.github/workflows/ via the API) and didn't make explicit that
`public_repo` is rejected by `repository_dispatch`. Rewritten.
M8 (correctness): docker-publish.yml's cosign verify step targeted the
mutable `:VERSION` tag instead of `@${EXPECTED_DIGEST}`. The preceding
verify-promoted-tags step already confirms the tag resolves to the
expected digest, but using the tag here leaves a tag-resolution TOCTOU
window between the two steps. Trivy's re-scan already uses
`@${EXPECTED_DIGEST}`; switching cosign to the same reference is
consistent and races-free.
L2 (style): While editing the cosign step, routed `github.repository`
through an `env:` var (`REPO`) instead of direct `${{ }}` template
interpolation into shell args, matching the convention in the rest of
this workflow.
* chore(ci): bump harden-runner pin in docker-publish.yml to match other workflows
Last remaining v2.19.1 reference — every other workflow in this PR was
bumped to v2.19.3 when main moved forward. Auto-merge missed this one
because the surrounding hunk was in a conflict region.
* chore(release): fixes from multi-round subagent review of the full PR
Bundle of low-risk fixes confirmed by 30 subagents across 3 rounds.
None are blockers; all are worth fixing in-scope.
1. SLSA provenance builder.id: was github.workflow_ref, which inside a
workflow_call callee resolves to the CALLER (release.yml), not the
intended callee (prerelease-docker.yml). The Fulcio cert is still
right (built from the job_workflow_ref OIDC claim), so cosign verify
and slsa-verifier are unaffected, but raw-JSON consumers reading
builder.id would see release.yml. Compose the value from
github.repository + hardcoded path + github.ref instead — the `job`
context has no workflow_ref property (actionlint confirms), and for
a local-path workflow_call the callee's ref equals github.ref.
2. Dockerfile: set ENV LDR_DATA_DIR=/data so the VOLUME /data directive
is actually load-bearing. Without it, paths.py falls back to
platformdirs (~/.local/share/local-deep-research) which is inside the
ephemeral container layer — bare docker run -v vol:/data users would
silently lose data on docker rm.
3. trigger-pypi: forward prerelease=false in client_payload. publish.yml
gates Test PyPI vs prod PyPI on client_payload.prerelease == true; if
absent, the expression evaluates to '' and falls through to prod. Set
false explicitly to remove the silent-fallback landmine.
4. Stale/misleading cosign comments in release.yml:
- line 322: said "v2.6.0" while value is "v2.6.3" — corrected and
noted GHSA-w6c6-c85g-mmv6 patch coverage
- line 332: attributed --bundle to v3.0.2+ but it's been in v2.4.0+
5. release-gate.yml Node 20 → 24 (mirror publish.yml + Dockerfile).
package.json declares engines.node >=24.0.0. The pip-install-check
wheel is discarded so this was not a release-blocker, but the gate
now validates the actual ship runtime.
6. README cosign-verify recipe:
- Guard empty PLATFORM_DIGEST with a clear message for single-arch
or pre-build-once-promote releases
- Add docker buildx to prerequisites list
- Spell out the legacy-verification substitution explicitly
* fix(ci): pin Trivy in promote step via SHA-pinned action wrapper
AI reviewer flagged docker-publish.yml's promote step as installing Trivy
via `sudo apt-get install -y trivy` with no version pin, reintroducing a
supply-chain risk to the release path. The prerelease scan in
prerelease-docker.yml uses the SHA-pinned aquasecurity/trivy-action
@ed142fd... wrapper with `version: 'v0.69.2'`, but the promote step
switched to the bare CLI and lost that protection.
Replace the apt-get install + raw `trivy image` invocation with the same
pinned action wrapper. Same scan semantics (CRITICAL,HIGH, ignore-unfixed,
.trivyignore, exit-code 1), same binary version (v0.69.2), same action
SHA — keeps the two scans consistent and removes the unpinned apt path.
* fix(ci): pin Trivy in release.yml build job — same fix as docker-publish.yml
R4 review caught that the AI-reviewer-flagged unpinned Trivy install also
exists in release.yml's `build` job, and is STRICTLY WORSE there because
that job carries `id-token: write` (for cosign keyless signing of SBOMs).
The attack chain that was open:
1. Aqua apt-repo compromise OR MITM of the unpinned GPG-key fetch
2. Malicious `trivy fs` binary installed
3. Binary exfiltrates ACTIONS_ID_TOKEN_REQUEST_URL/TOKEN env vars,
minting an OIDC token under repo:LearningCircuit/local-deep-research
4. Binary tampers with sbom-spdx.json / sbom-cyclonedx.json contents
5. Next step `Sign release artifacts with Sigstore` cosign-signs the
tampered SBOM with a legitimate Sigstore cert → fraudulent SBOM
attached to the GitHub release with valid signature
Replace with the SHA-pinned aquasecurity/trivy-action@ed142fd0... (same
pin as docker-publish.yml and prerelease-docker.yml) using scan-type=fs
for the filesystem scan, with `version: 'v0.69.2'` to pin the binary
itself. Two separate action invocations (one per output format) because
the action takes a single format per run.
Also removes the unpinned `gpg --dearmor` of an unverified-fingerprint
public key, which the prior comment misleadingly called "secure".
* fix(ci): use TRIVY_USERNAME/PASSWORD env vars for trivy-action auth
The trivy-action README prescribes TRIVY_USERNAME/TRIVY_PASSWORD env
vars as the supported Docker Hub auth path. Even though docker/login-
action already wrote ~/.docker/config.json earlier in the job (and Trivy
reads it as a fallback), there's documented fragility with docker.io
credential helpers (aquasecurity/trivy#432, aquasecurity/trivy#8385)
that surfaces specifically on registry-pull scans like this one (unlike
the prerelease scan which uses a locally-loaded image).
The fallback would probably work today since localdeepresearch/
local-deep-research is public — anonymous pull would succeed even
without auth — but rate-limiting on anonymous Docker Hub pulls is
aggressive and the documented credential-helper quirks are real. Adding
the env vars uses the action's prescribed auth path, with the same
DOCKER_USERNAME/DOCKER_PASSWORD secrets already passed in via
workflow_call. Zero-cost defense-in-depth.
12 KiB
CI/CD and Infrastructure Documentation
This document describes the continuous integration, security scanning, and development infrastructure used by the Local Deep Research project.
Overview
The project uses many GitHub Actions workflows and 20+ pre-commit hooks to ensure code quality, security, and reliability.
At-a-glance health: see
docs/ci/workflow-status.md— an auto-generated dashboard with live badges for every workflow, surfacing disabled, manual-only, and stale (silently-failing) ones at the top. Regenerate withpdm run python scripts/generate_workflow_status.py.
┌─────────────────────────────────────────────────────────────────┐
│ Developer Workflow │
├─────────────────────────────────────────────────────────────────┤
│ Local Development │ Pull Request │ Main/Dev │
│ ───────────────── │ ──────────── │ ──────── │
│ • Pre-commit hooks │ • All tests │ • Deploy │
│ • Unit tests │ • Security scans │ • Publish │
│ • Linting │ • Code review │ • Release │
└─────────────────────────────────────────────────────────────────┘
Pre-Commit Hooks
Pre-commit hooks run locally before each commit. Install with:
pre-commit install
pre-commit install-hooks
Standard Hooks
| Hook | Purpose |
|---|---|
check-yaml |
Validate YAML syntax |
end-of-file-fixer |
Ensure files end with newline |
trailing-whitespace |
Remove trailing whitespace |
check-added-large-files |
Block files >1MB |
check-case-conflict |
Prevent case-sensitivity issues |
forbid-new-submodules |
Prevent git submodules |
Security Hooks
| Hook | Purpose |
|---|---|
gitleaks |
Detect secrets, API keys, passwords in code |
check-sensitive-logging |
Prevent logging of passwords, tokens, keys |
check-safe-requests |
Enforce SSRF-safe HTTP functions (safe_get, safe_post) |
check-url-security |
Validate URL handling in JavaScript (XSS prevention) |
file-whitelist-check |
Only allow approved file types |
check-image-pinning |
Require SHA256 digests for Docker images |
Code Quality Hooks
| Hook | Purpose |
|---|---|
ruff |
Python linter (with auto-fix) |
ruff-format |
Python formatter (Black-compatible) |
eslint |
JavaScript linter |
shellcheck |
Shell script linter |
actionlint |
GitHub Actions workflow validator |
custom-code-checks |
Loguru usage, UTC datetime, raw SQL detection |
Project-Specific Hooks
| Hook | Purpose |
|---|---|
check-env-vars |
Environment variables must use SettingsManager |
check-deprecated-db-connection |
Enforce per-user database connections |
check-ldr-db-usage |
Prevent shared ldr.db usage |
check-research-id-type |
research_id must be string/UUID, not int |
check-datetime-timezone |
All DateTime columns (models and migrations) must use UtcDateTime from sqlalchemy_utc |
check-session-context-manager |
Require context managers for DB sessions |
check-pathlib-usage |
Use pathlib.Path instead of os.path |
check-no-external-resources |
No external CDN/resource references |
check-css-class-prefix |
CSS classes must have ldr- prefix |
GitHub Actions Workflows
Test Workflows
| Workflow | Trigger | Purpose |
|---|---|---|
docker-tests.yml |
PR, push | Consolidated Docker tests: pytest + coverage, UI tests (51 Puppeteer tests), LLM tests, infrastructure tests (single Docker build shared across all jobs). Includes tests previously in critical-ui-tests, extended-ui-tests, metrics-analytics-tests, library-ui-tests, mobile-ui-tests, and news-tests workflows. |
e2e-research-test.yml |
PR, push | End-to-end research flow |
fuzz.yml |
Schedule | Fuzzing tests |
Security Scanning
| Workflow | Trigger | Purpose |
|---|---|---|
codeql.yml |
PR, push, schedule | GitHub CodeQL analysis |
semgrep.yml |
PR, push | Semgrep static analysis |
osv-scanner.yml |
PR, push, schedule | OSV vulnerability scanning (Python + npm) |
gitleaks.yml |
PR, push | Secret detection |
security-tests.yml |
PR, push | Security-focused test suite |
devskim.yml |
PR, push | Microsoft DevSkim analysis |
checkov.yml |
PR, push | Infrastructure-as-code scanning |
container-security.yml |
PR, push | Container vulnerability scanning |
hadolint.yml |
PR, push | Dockerfile linting |
owasp-zap-scan.yml |
Schedule | OWASP ZAP dynamic scanning |
retirejs.yml |
PR, push | JavaScript vulnerability scanning |
zizmor-security.yml |
PR, push | Additional security checks |
ossf-scorecard.yml |
Schedule | OpenSSF Scorecard |
security-headers-validation.yml |
PR, push | HTTP security headers |
security-file-write-check.yml |
PR, push | File write security |
npm-audit.yml |
PR, push | npm audit for JS dependencies |
Dependency Management
| Workflow | Trigger | Purpose |
|---|---|---|
dependency-review.yml |
PR | Review dependency changes |
update-dependencies.yml |
Schedule | Auto-update Python deps |
update-npm-dependencies.yml |
Schedule | Auto-update npm deps |
update-precommit-hooks.yml |
Schedule | Update pre-commit hooks |
validate-image-pinning.yml |
PR, push | Verify Docker image pins |
UI/Accessibility
| Workflow | Trigger | Purpose |
|---|---|---|
responsive-ui-tests-enhanced.yml |
PR, push | Responsive design tests |
Build & Deploy
| Workflow | Trigger | Purpose |
|---|---|---|
prerelease-docker.yml |
workflow_call from release.yml |
Canonical multi-arch Docker build, cosign sign, SBOM/SLSA attestations. Jobs declare environment: release so the first release env approval gates the build (env-scoped Docker Hub secrets). |
docker-publish.yml |
workflow_call from release.yml |
Retag prerelease manifest as :1.6.9 / :1.6 / :latest (gated by release env). No rebuild — registry-side metadata only. Inlined as a reusable workflow so its result is visible to downstream jobs in release.yml (lets create-release block on Docker success, lets cleanup-on-rejection safely scope cosign artifact deletion). |
docker-multiarch-test.yml |
PR, push | Multi-architecture build test |
publish.yml |
repository_dispatch from release.yml |
Publish to PyPI. Stays on repository_dispatch (not workflow_call) because PyPI Trusted Publishing rejects OIDC claims from reusable workflows — pypa/gh-action-pypi-publish#166, pypi/warehouse#11096. |
release.yml |
Push to main, tag v*.*.*, manual |
Orchestrate release: gates → build → provenance → prerelease-docker → publish-docker → trigger-pypi → monitor-pypi → create-release (last) |
Code Quality
| Workflow | Trigger | Purpose |
|---|---|---|
pre-commit.yml |
PR, push | Run pre-commit hooks in CI |
mypy-type-check.yml |
PR, push | Python type checking |
ai-code-reviewer.yml |
PR | AI-assisted code review |
claude-code-review.yml |
PR | Claude-based code review |
Repository Management
| Workflow | Trigger | Purpose |
|---|---|---|
sync-main-to-dev.yml |
Push to main | Sync main branch to dev |
label-fixed-in-dev.yml |
Push to dev | Auto-label fixed issues |
danger-zone-alert.yml |
PR | Alert on sensitive file changes |
check-env-vars.yml |
PR, push | Environment variable validation |
file-whitelist-check.yml |
PR, push | File type validation |
version_check.yml |
PR, push | Version consistency check |
Dependabot Configuration
Dependabot automatically creates PRs for dependency updates:
| Ecosystem | Directories | Schedule |
|---|---|---|
| Python (pip) | / |
Weekly (Monday 04:00) |
| npm | /, /tests/* |
Weekly/Daily |
| GitHub Actions | / |
Weekly |
| Docker | / |
Daily |
Coverage Reporting
Coverage reports are generated by the docker-tests.yml workflow (pytest-tests job):
- HTML Report: Deployed to GitHub Pages at
https://learningcircuit.github.io/local-deep-research/coverage/ - PR Comments: Each PR receives a comment with coverage percentage
- Badge: Coverage badge updated via GitHub Gist
Configuration in pyproject.toml:
[tool.coverage.run]
source = ["src"]
omit = ["*/tests/*", "*/migrations/*"]
[tool.coverage.report]
exclude_lines = ["pragma: no cover", "if TYPE_CHECKING:"]
Security Architecture
Supply Chain Security
- Dependency Pinning: All GitHub Actions use SHA256 digests
- Docker Image Pinning: All base images use SHA256 digests
- Lock Files:
pdm.lockandpackage-lock.jsoncommitted - Vulnerability Scanning: OSV-Scanner, npm audit, RetireJS
Runtime Security
- SSRF Protection:
safe_get(),safe_post(),SafeSessionwrappers - XSS Prevention: DOMPurify for HTML sanitization
- SQL Injection: SQLAlchemy ORM (no raw SQL)
- Secret Management: Environment variables via
SettingsManager
Container Security
- Non-root User: Containers run as
ldruser:1000 - Minimal Base Image: Python slim images
- Health Checks: Docker health check endpoints
- Read-only Where Possible: Minimal write permissions
Running Tests Locally
Quick Test (Unit Tests Only)
pdm run pytest tests/test_settings_manager.py tests/test_utils.py -v
Full Test Suite
pdm run pytest tests/ --ignore=tests/ui_tests --ignore=tests/fuzz -v
With Coverage
pdm run pytest tests/ --cov=src --cov-report=html -v
open coverage/htmlcov/index.html
UI Tests (Requires Server)
# Terminal 1: Start server
pdm run ldr-web
# Terminal 2: Run UI tests
cd tests/ui_tests && npm test
Docker Testing
Build and run tests in Docker:
# Build test image
docker build --target ldr-test -t ldr-test .
# Run tests
docker run --rm -v "$PWD":/app -w /app ldr-test \
pytest tests/ --ignore=tests/ui_tests -v
Environment Variables for CI
| Variable | Purpose |
|---|---|
CI=true |
Indicates CI environment |
LDR_TESTING_WITH_MOCKS=true |
Enable test mocks |
LDR_DISABLE_RATE_LIMITING=true |
Disable HTTP rate limits in tests (canonical name). The legacy DISABLE_RATE_LIMITING=true is still honored but emits a deprecation warning. Distinct from LDR_RATE_LIMITING_ENABLED, which controls the adaptive search-engine rate limiter — different subsystem. |
Adding New Workflows
When adding a new workflow:
- Use pinned action versions with SHA256 digests
- Add
permissions: {}at top level (minimal permissions) - Add job-level permissions as needed
- Include
step-security/harden-runnerstep - Add workflow to this documentation
Example template:
name: New Workflow
on:
pull_request:
branches: [main]
permissions: {}
jobs:
example:
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Harden the runner
uses: step-security/harden-runner@... # pinned
with:
egress-policy: audit
- uses: actions/checkout@... # pinned
with:
persist-credentials: false