mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-16 03:51:07 +03:00

Files

LearningCircuit 1f0b0a4a95 ci(release): build-once-promote refactor for Docker pipeline (#3977 )

* ci(release): build-once-promote refactor for Docker pipeline

Today the release pipeline builds the Docker image twice — once in
prerelease-docker.yml for "testing" and again in docker-publish.yml for
the actual release. The image you tested is not the image you ship: base
layer patches, transitive deps, and apt/pip resolution can diverge
between the two builds.

This refactor makes prerelease-docker.yml the canonical build and turns
docker-publish.yml into a thin retag step. `docker buildx imagetools
create` is a registry-side metadata operation that takes seconds and
preserves the manifest digest, so the released image is bit-identical
to the one tested. Cosign signatures, SBOM attestations, and SLSA
provenance are stored at sha256-<digest>.{sig,att} keyed by digest, so
signing once in prerelease covers the release tags transitively.

Pipeline shape changes:

- prerelease-docker.yml is now a reusable workflow (workflow_call) called
  from release.yml. It builds, scans (Trivy), signs (cosign), attests
  the SBOM (cosign attest --type spdxjson, replacing the deprecated
  cosign attach sbom), and emits SLSA provenance. The manifest_digest is
  exposed as a workflow output. The `prerelease` environment gates the
  first build job for human approval.
- docker-publish.yml shrinks from ~457 to ~250 lines. It receives source_tag
  and expected_digest in the dispatch payload, verifies the source digest
  before retag, retags via imagetools create, verifies the digest is
  preserved (defense against re-encoding), re-runs Trivy against the
  digest (catches CVE-DB updates between prerelease and promote),
  verifies the cosign signature transitivity, and runs the existing
  prerelease cleanup loop.
- release.yml adds prerelease-docker to create-release.needs and
  trigger-workflows.needs, so the GitHub Release and the publish dispatch
  only happen after the canonical Docker build completes. The dispatch
  payload now carries source_tag and expected_digest. A new
  cleanup-on-rejection job removes orphan prerelease tags and cosign
  artifacts when the release is rejected (without it, every rejection
  would leave dangling sha256-<digest>.{sig,att} on Docker Hub).
- README cosign verify example updated to the keyless invocation users
  actually need (identity regex pointing at prerelease-docker.yml,
  --certificate-oidc-issuer, --certificate-github-workflow-repository),
  plus the SBOM verify-attestation command.

Notable design decisions (verified across multiple subagent review
rounds):

- SLSA provenance entryPoint stays as release.yml (the top-level caller).
  Per the SLSA GHA buildtype v1 spec and the canonical
  slsa-github-generator behavior, reusable workflows are explicitly NOT
  entryPoints — pointing at prerelease-docker.yml would break verifier
  policies that allowlist trigger workflows.
- Cosign cert identity for verification matches Fulcio's SAN URI, which
  is built from job_workflow_ref — the CALLEE for reusable workflows. So
  the identity regex matches prerelease-docker.yml even though the build
  is invoked from release.yml. Hardened with escaped dots, refs/(heads|tags)/
  constraint, and --certificate-github-workflow-repository to defend
  against the reusable-workflow-identity-reuse class of attacks.
- cleanup-on-rejection uses an allowlist if (failure || cancelled), not
  a denylist (!= 'success'), to avoid firing on `skipped` (e.g. when
  release_exists short-circuits the run). It also fails loudly on 401/403
  from the Docker Hub API so a missing Delete scope on the PAT can't
  silently let orphans accumulate.

Supersedes #3969 (split-environment): the env split is preserved by the
new structure — prerelease env on the called workflow's first job,
release env on create-release/trigger-workflows.

Pre-merge checklist for the maintainer:
- Create the `prerelease` environment in GitHub Settings with the same
  required reviewers as `release`. Without it, the called workflow's
  approval gate auto-creates the env with no protection rules and
  silently approves the build.
- Verify DOCKER_USERNAME / DOCKER_PASSWORD remain repo-level secrets
  (they currently are). Environment-scoped secrets do not propagate
  across reusable workflow calls except via the called job's own
  environment.

* ci(release): fixes from multi-round subagent review

Round 1 surfaced 14 candidate findings; Round 2 verified 7 as real bugs and
refuted 4 as false positives. This commit applies the verified fixes.

CONFIRMED bugs fixed:

1. **Approval gate was per-job, not workflow-wide.** The previous
   `environment: prerelease` on `build-amd64` only let `build-arm64` and
   `security-scan` run pre-approval (GitHub environments are
   job-scoped per docs + community/discussions/174381). Replaced with a
   sentinel `approval-gate` job that all three build jobs `needs:`. Single
   approval click still gates everything, but now actually blocks all
   parallel jobs.

2. **`cleanup-on-rejection` if-condition missed the prerelease-rejection
   path.** When prerelease-docker.result was `failure`, both create-release
   and trigger-workflows became `skipped` (their `if:` requires success),
   and the cleanup `if:` only fired on `failure`/`cancelled` of dependents.
   Added explicit `prerelease-docker.result == 'failure'` clause so the
   most common rejection path actually triggers cleanup.

3. **Trivy re-scan ran AFTER retag.** A failing scan would leave release
   tags `:1.6.9`, `:1.6`, `:latest` publicly published with no rollback.
   Reordered: scan source digest BEFORE retag. Content is bit-identical
   (same digest), so scanning the prerelease tag tests what would be
   promoted — but failure now leaves no public broken tags. Also moved
   cosign verify before retag for the same reason.

4. **Trivy only scanned linux/amd64 by default** against a manifest list
   digest (per Trivy docs + aquasecurity/trivy#7847). Replaced single scan
   with two explicit per-platform invocations
   (`--platform linux/amd64`, then `linux/arm64`) so arm64 layers are also
   gated by the freshness check.

5. **Trivy DB freshness wasn't guaranteed.** apt-installed Trivy may use a
   stale embedded DB. Added explicit `trivy image --download-db-only`
   before the scans so the CVE-DB freshness window the re-scan exists for
   is actually exercised.

6. **`cosign attest` re-runs accumulated attestation layers** (verified via
   cosign 2.x `mutate.go` `dedupeAndReplace`). Added `--replace` to both
   attest calls (SLSA provenance + SBOM). Sigstore spec allows multi-sig
   so `cosign sign` is left as-is.

7. **SLSA provenance values inherited from old code were misleading.**
   - `builder.id`: changed from `https://github.com/actions/runner` (the
     agent binary) to the workflow ref the build is actually defined in
     (per SLSA v0.2 spec — builder.id should be a verifiable trust root).
   - `completeness.{parameters,environment,materials}`: flipped from
     `true` to `false`. The predicate captures no workflow_call inputs,
     no environment, and the build does network I/O — claiming
     completeness was a public signed false statement.
   - `buildInvocationId`: now includes `${run_id}-${run_attempt}` so
     re-runs are distinguishable.

REFUTED (kept as-is, with confidence):

- `imagetools create` does NOT change the digest in this case. Buildx's
  Combine() in util/imagetools/create.go has an explicit short-circuit
  for single-source manifest-list inputs that returns the bytes
  byte-for-byte (no annotations + same registry required, both true here).
- Concurrent rejection digest collision is not a real concern — Docker
  builds in this pipeline are not bit-deterministic (apt, network, file
  timestamps, default provenance attestations all vary).
- The `prerelease-v1.6.9-*` cleanup pattern does NOT collide with
  `prerelease-v1.6.91-*` (trailing dash in the prefix disambiguates).
- Reusable-workflow approval prompts appear inline on the caller run
  page for single-level calls — not a UX regression.

* ci(release): revert most Round 2 review additions

Keep the build-once-promote refactor's structural shape but back out the
defensive additions from commit 68606b299:

- approval-gate sentinel job → revert to `environment: prerelease` on
  build-amd64 only
- SLSA builder.id, completeness flags, buildInvocationId → revert to
  inherited values from the previous docker-publish.yml
- `cosign attest --replace` → drop, accept default append behavior
- Pre-promote Trivy + multi-platform scans + db refresh + pre-promote
  cosign verify → revert to single post-promote scan and post-promote
  cosign verify
- cleanup-on-rejection if-condition → drop the
  `prerelease-docker.result == 'failure'` allowlist clause

Rationale: keep the change set minimal vs main. The defensive additions
were correct in isolation but expand scope of this PR.

* fix(ci): drop invalid --trivyignores flag from raw trivy CLI invocation

The Round 2 promote step used `--trivyignores .trivyignore`, which is the
INPUT name of the aquasecurity/trivy-action wrapper, not a flag of the
raw Trivy binary. The CLI accepts only `--ignorefile` (singular) and
auto-loads `.trivyignore` from cwd by default.

As-was, every release run would hard-fail with `unknown flag:
--trivyignores` from cobra/pflag before any scanning occurred. Removing
the flag is sufficient — Trivy auto-loads the ignorefile from the
checkout root.

prerelease-docker.yml is unaffected: it uses the action wrapper with
`trivyignores: '.trivyignore'` as input, which IS correct usage for the
action layer (it translates to --ignorefile internally via
TRIVY_IGNOREFILE).

Sources:
- https://trivy.dev/latest/docs/references/configuration/cli/trivy_image/
- https://github.com/aquasecurity/trivy-action/blob/master/action.yaml

* ci(release): apply remaining bugs from multi-round review

After Round 4 verification confirmed several deferred findings, applying
the bug fixes the user explicitly requested:

1. Re-introduce the `approval-gate` sentinel job in prerelease-docker.yml.
   GitHub Actions environments are job-scoped, so without a gate sentinel
   `build-arm64` and `security-scan` would run pre-approval — pushing
   the `-arm64` per-arch tag and consuming Trivy minutes regardless of
   whether the maintainer approved or rejected the gate. Single approval
   click still gates everything via `needs: [approval-gate]`.

2. Fix the SLSA `builder.id` to use `${{ github.workflow_ref }}` instead
   of the inherited `https://github.com/actions/runner` agent identity.
   `workflow_ref` resolves to the canonical
   `<owner>/<repo>/.github/workflows/<file>.yml@<callee-ref>` format that
   matches slsa-github-generator's output and that verifier policies can
   pin against.

3. Flip SLSA `completeness.{parameters,environment,materials}` from
   `true` to `false`. The predicate captures no workflow_call inputs, no
   environment, and the build does network I/O — claiming completeness
   was a public signed false statement.

4. Add `${{ github.run_attempt }}` to the SLSA `buildInvocationId` so
   "Re-run failed jobs" attempts are distinguishable.

5. Expand `cleanup-on-rejection` `if:` to include
   `prerelease-docker.result == 'failure'` and `'cancelled'`. Without
   these clauses, the most common rejection path (env approval rejected
   for prerelease) leaves dependents `skipped`, which the existing
   allowlist doesn't match — orphan tags persist on Docker Hub forever.

6. Drop unused `packages: write` from both the called workflow and the
   caller's reusable-workflow block. Docker Hub auth uses
   DOCKER_PASSWORD, not GITHUB_TOKEN; `packages: write` only matters for
   ghcr.io which the project doesn't use.

7. Update `docs/CI_CD_INFRASTRUCTURE.md` Build & Deploy table to reflect
   the build-once-promote split.

8. Update `docs/RELEASE_GUIDE.md` "Automatic Publishing" section to
   describe both approval gates (`prerelease` and `release`).

* ci(release): R5/R6 review fixes — cosign pin, multi-arch SBOM, orphan SBOM

Round 5 (10 agents) and Round 6 (5 agents debunking) verified these
findings, all of which are now applied:

1. **Pin cosign to v2.6.0**. R6A2 verified that `sigstore/cosign-installer@v4.1.2`
   ships cosign v3.0.6 by default. cosign v3 enables `--new-bundle-format`
   ON BY DEFAULT, which changes the on-wire signature/attestation format.
   Mismatched version across sign/verify works in-pipeline (both on v3),
   but downstream verifiers running the README cosign-verify recipe on v2
   would fail. Pinning all three cosign-installer steps to v2.6.0 keeps
   the legacy tag-based sigstore format until we deliberately migrate
   the entire ecosystem.

2. **Multi-arch SBOM via per-arch attestations**. R6A3 verified the claim
   (anchore/syft#1708, actions/attest-sbom#60): syft against a manifest
   list digest only scans the host platform's layers. The previous SBOM
   attestation against the manifest digest claimed to describe both
   amd64 + arm64 but actually only enumerated amd64. ARM64 consumers
   were verifying a misleading SBOM. Fix: iterate over manifest entries
   from `imagetools inspect --raw`, run `syft --platform <plat>` against
   each per-arch digest, and `cosign attest --replace --type spdxjson`
   each per-arch SBOM against the per-arch digest. ALSO keep a
   manifest-list-level SBOM (host arch only) so end-users running
   `cosign verify-attestation user/img:latest` don't get an empty result.

3. **Re-add `--replace` to cosign attest** (both SLSA and SPDX). R5A7's
   deeper analysis enumerated specific failure modes beyond cosmetic
   clutter: Kyverno `count: 1` policies, registry layer count caps,
   audit ambiguity (verify returns success on first matching layer),
   Rekor entry bloat. R3A5 already confirmed `--replace` is per-
   predicate-type, so SLSA and SPDX attestations don't disturb each
   other.

4. **Container-image SBOM no longer orphaned**. R6A4 verified: the
   Syft-produced container SBOMs were uploaded as artifact `sbom` from
   prerelease-docker.yml but never downloaded by `create-release` — they
   were invisible on the GitHub Release page. Fix: download the `sbom`
   artifact, rename to `sbom-container-*` to disambiguate from the
   filesystem `sbom-spdx.json`, and attach to `gh release create`.

5. **Narrow `secrets: inherit` to explicit secrets**. R5A3 flagged that
   `secrets: inherit` propagates ALL repo secrets (PAT_TOKEN,
   OPENROUTER_API_KEY, SERPER_API_KEY, GITHUB_TOKEN) into a workflow
   that only needs Docker Hub creds. Replaced with explicit
   `DOCKER_USERNAME` + `DOCKER_PASSWORD` mapping; the called workflow
   now declares these as required `workflow_call.secrets`.

6. **Drop unused `DEPS_HASH` build-arg**. R5A2 confirmed it was declared
   in the Dockerfile but never referenced in any RUN/COPY, so it never
   busted the Docker layer cache. Cache invalidation already happens
   correctly via `COPY pdm.lock` (file content hash). Removed the ARG
   declaration from Dockerfile and the three `build-args:` passes from
   prerelease-docker.yml.

R6 also REFUTED two earlier claims:
- R5A8's concurrency claim: reusable workflows DO share the caller's
  `workflow_run` and concurrency group (R3A8 was correct). Don't add a
  `concurrency:` block to prerelease-docker.yml — would create a
  separate group and re-introduce the race R5A8 imagined.
- R5A10's harden-runner CVE claim: v2.19.1 (used here) is well after
  the fix versions for both CVE-2026-32946 (v2.16.0) and CVE-2026-25598
  (v2.14.2). No bump needed.

* ci(release): R7 fixes — cosign v2.6.3, drop misleading manifest-level SBOM

Round 7 (5 agents) verified the R5/R6 fixes and surfaced two real bugs:

1. **cosign-installer pinned cosign v2.6.0**, which has two known security
   advisories: GHSA-whqx-f9j3-ch6m (fixed in v2.6.2) and GHSA-w6c6-c85g-mmv6
   (fixed in v2.6.3). Bumped pin to v2.6.3 in all three workflow files so
   the install step picks up the fixes. Same minor (v2.6.x), so no flag
   drift — `--replace`, `--type`, `--bundle`, `--certificate-*` all behave
   identically.

2. **The manifest-level SBOM attestation was misleading**. The previous
   step ran `syft <repo>@<manifest-list-digest>` on an amd64 runner,
   which (per anchore/syft#1708) only enumerates amd64 layers. The SBOM
   was then attested at the manifest-list digest where it was discoverable
   by ALL platform consumers — so an arm64 user verifying `:latest` would
   receive a signed SBOM that lies about the layers they actually pulled.
   The per-arch loop already produces accurate per-platform SBOMs; the
   manifest-level fallback only re-introduced the lie for UX convenience.

   Dropped the manifest-level attest call entirely. Per-arch SBOMs are the
   only honest representation. Updated the README's `cosign
   verify-attestation` recipe to resolve to the per-platform digest first
   (using `jq` over `imagetools inspect --raw`), so end-users on either
   architecture get the SBOM that actually describes what they pulled.
   Removed `sbom.spdx.json` from the workflow artifact + release-staging
   logic since it no longer exists.

3. **Empty-loop assertion**: added a defensive count check before the
   per-arch SBOM loop. If a future buildx output change ever produced
   zero per-arch entries (e.g., all entries marked architecture: unknown),
   the previous code would silently skip the loop and pass CI green with
   no SBOMs. Now it fails loud with the raw manifest dumped for debugging.

Note on round-7 reviewer's other concerns:
- "Pipe-to-while subshell scope": confirmed safe. set -euo pipefail
  inherited; failures in syft/cosign attest abort the subshell, and
  pipefail propagates to the outer step.
- "imagetools inspect --raw stability": OCI image-index spec is stable
  for ~7 years. The jq filter handles the BuildKit attestation pseudo-
  entries via `architecture != "unknown"`.
- "harden-runner v2.19.1 CVEs": false alarm. v2.19.1 is well above the
  fix versions (v2.16.0, v2.14.2). No bump needed.

* ci(release): R8 fixes from 8th review round

Round 8 (5 agents covering Dockerfile, npm/Vite, runtime image, edge
cases, and post-fix smoke check) surfaced 7 real bugs the previous 7
rounds missed. All fixed here, plus a comment per user request.

1. **docker-publish.yml checkout pinned to released tag**. The promote
   step reads `.trivyignore` from cwd; a `repository_dispatch`-triggered
   checkout defaults to the default branch's tip, which can drift between
   prerelease scan and promote scan if `.trivyignore` is edited on main
   while the release awaits approval. Added `ref: ${{
   github.event.client_payload.tag }}` to checkout.

2. **docker-publish.yml concurrency block added**. release.yml has its
   own concurrency, but docker-publish.yml is a separate workflow run.
   Two near-simultaneous publish-docker dispatches for the same release
   tag (e.g., a manual re-trigger after a transient Docker Hub 5xx) could
   interleave and have their cleanup-loop prefix-match deletions race
   each other. Group: `publish-docker-${{ github.event.client_payload.tag
   }}`, cancel-in-progress: false.

3. **publish.yml's frontend builder bumped from Node 20 → 24** to match
   `package.json`'s `engines: { node: ">=24.0.0" }`. Mismatched Node
   versions across the PyPI build (Node 20) and the Docker image (Node
   24, installed via NodeSource) could resolve transitive deps differently
   and ship frontend assets that fail at runtime. Pinned to specific
   `node:24-alpine` SHA.

4. **HEALTHCHECK no longer leaks Python processes**. The old
   `urllib.request.urlopen(...)` had no Python-level timeout, so a
   hung-but-alive backend would freeze the probe until Docker's outer
   timeout SIGKILL'd it — leaving a Python process per probe interval
   leaking PIDs/FDs over time. Added `timeout=5` and an explicit `r.status
   == 200` check so non-200 2xx responses (e.g., from misconfigured
   proxies) don't pass.

5. **Removed broken `VOLUME /scripts/`**. /scripts is image content (the
   ollama entrypoint baked in by the layer below the VOLUME directive),
   not user state. A VOLUME on an image-populated path causes anonymous-
   volume accumulation on every `docker run` and silently shadows the
   script if a user ever bind-mounts it.

6. **Added `VOLUME /data`** so users who don't bind-mount don't silently
   lose research data + encrypted DBs on `docker rm`. The entrypoint
   creates the persistent state at /data/{logs,cache,encrypted_databases},
   but without VOLUME the directory is part of the writable image layer.

7. **Stale comment in release.yml** (the SBOM download step) updated —
   no longer mentions the manifest-level SBOM that was dropped in
   commit 33d69b4e4.

Plus one comment update per user request:
8. **`apt-get upgrade -y` rationale comment** added at the
   build-once-promote section of the Dockerfile (top stage), and
   cross-referenced from the other two `apt-get upgrade` sites
   (ldr-test stage and runtime stage). Documents that the trade-off of
   bit-for-bit reproducibility for always-fresh CVE patches is
   intentional, and explains how build-once-promote mitigates the
   reproducibility loss.

* ci(release): clean up per-arch cosign attestation orphans on rejection

Round 9 found that the per-arch SBOM attestations introduced in commit
11e702f7d (the multi-arch SBOM fix) live at
`sha256-<per-arch-digest>.{sig,att,sbom}` keyed by the PER-ARCH manifest
digests, not the manifest-list digest. The cleanup-on-rejection job only
knew the manifest-list digest, so on rejection paths the per-arch
attestation artifacts were left orphaned on Docker Hub forever — and
unreachable through any tag, since the per-arch leaf tags were also
deleted.

Fix: before deleting the manifest tag, inspect it via `imagetools inspect
--raw` to discover the per-arch digests, then queue per-arch
`{sig,att,sbom}` deletions alongside the manifest-level cleanup. If the
manifest tag doesn't exist (e.g., build failed before manifest creation),
log a clear warning and proceed — the per-arch artifacts wouldn't have
been created in that case anyway.

* ci(release): drop prerelease env gate — use single release approval

The `prerelease` environment approval was a holdover from when prerelease
docker was a SEPARATE test build alongside the release build (two
distinct artifacts, two distinct decisions). In the build-once-promote
model the "prerelease" image IS the release image (just under a
different tag), so gating the BUILD with a human approval is redundant —
the only meaningful decision is whether the tested image becomes the
official release.

Changes:
- Remove the `approval-gate` sentinel job in prerelease-docker.yml.
- Drop `needs: [approval-gate]` from build-amd64, build-arm64, and
  security-scan. They now run automatically once release.yml's security
  + CI gates pass.
- Update workflow comments in release.yml and prerelease-docker.yml to
  reflect the single-gate flow.
- Update RELEASE_GUIDE.md "Approval and Publishing" section: now
  describes ONE `release` env approval, not two.
- Update CI_CD_INFRASTRUCTURE.md row for prerelease-docker.yml.

The cleanup-on-rejection job is unchanged — its triggers still fire
correctly on prerelease-docker `failure`/`cancelled` (build/sign/attest
errors) and on create-release / trigger-workflows `failure`/`cancelled`
(release env rejection). One fewer rejection path to consider, but the
mechanism is the same.

Operational benefits:
- One fewer approval click per release
- One fewer GitHub Environment to create as a pre-merge setup step
  (no more "create the `prerelease` env in Settings before merging")
- Build completes during/after security gates, so the prerelease tag is
  ready by the time the maintainer is ready to test

* ci(docker-publish): group GITHUB_OUTPUT writes (shellcheck SC2129)

CI's actionlint hook (which runs shellcheck on workflow run blocks)
flagged the 'Determine release tags' step for issuing five sequential
`echo ... >> "$GITHUB_OUTPUT"` redirects. Grouped them into a single
braced block + one redirect, per SC2129's recommendation.

* docs(release): correct approval flow after env-scoped secrets merge

After merging main, prerelease-docker.yml's four jobs declare
`environment: release` (PRs #3978/#3983) because DOCKER_USERNAME and
DOCKER_PASSWORD are env-scoped. That means the first `release` env
approval now gates the canonical build, not just the publish step —
the "automatic build then test then approve" flow described in earlier
docs no longer matches reality.

- RELEASE_GUIDE.md: rewrite the approval section to describe two
  release-env approvals (release.yml + docker-publish.yml) and the
  narrow Docker-only test window between them.
- CI_CD_INFRASTRUCTURE.md: update the prerelease-docker.yml row.
- release.yml: rewrite the `prerelease-docker:` job comment to reflect
  that this step is gated, not automatic, and explain why.

* ci(release): atomic publish ordering — GitHub Release runs last (#4044)

* ci(release): make GitHub Release publishing atomic with Docker + PyPI

Before this change, `create-release` published the public GitHub Release
BEFORE `docker-publish.yml` retagged and BEFORE `publish.yml` shipped to
PyPI. If either downstream failed, the public Release pointed at
non-existent artifacts.

This change closes that window:

- Convert `docker-publish.yml` from `repository_dispatch` to
  `workflow_call`. Its result is now visible to release.yml as
  `needs.publish-docker.result`, which lets:
  * `create-release` block on Docker promote success
  * `cleanup-on-rejection` safely scope cosign artifact deletion to
    cases where retag failed (after a successful retag, release tags
    share the prerelease manifest digest, so cosign artifacts must
    stay — deleting them would invalidate release-tag verification)
- Keep `publish.yml` on `repository_dispatch`. PyPI Trusted Publishing
  matches the OIDC `workflow_ref` claim against the CALLER when invoked
  via `workflow_call`, so a reusable publish.yml would fail with
  `invalid-publisher`. Tracked in pypa/gh-action-pypi-publish#166 and
  pypi/warehouse#11096.
- Restructure release.yml job graph:
    prerelease-docker → publish-docker (reusable) → trigger-pypi
      → monitor-pypi → create-release (LAST)
- Rewrite `cleanup-on-rejection` with a partial-retag rollback preamble.
  `imagetools create -t :VERSION -t :MAJOR_MINOR -t :latest` is a single
  process with multiple registry calls, so a mid-step failure can leave
  some release tags landed. The cleanup script now checks each release
  tag against Docker Hub and rolls back any that exist BEFORE deleting
  cosign signature/attestation artifacts.
- Slim `monitor-publish` → `monitor-pypi` (only watches publish.yml now;
  Docker is tracked natively via the inline job result).
- Drop the workflow-level `concurrency:` block from docker-publish.yml.
  As a reusable workflow it shares release.yml's run, and release.yml's
  caller-level concurrency on `github.ref` already serialises releases
  for the same tag.
- Update `docs/CI_CD_INFRASTRUCTURE.md` workflow-table rows and
  `docs/RELEASE_GUIDE.md` approval-flow section to describe the new
  ordering, plus a "Recovery from PyPI failure" section documenting the
  one remaining atomicity hole (PyPI fails after Docker success — Docker
  release tags exist, no PyPI, no GH Release; manual re-dispatch needed).

Plan + 5-agent Round 1 review notes saved separately.

* fix(release): plug blockers found in multi-round PR review

Four fixes against the atomicity refactor — two blockers that would
break the next release, two hardening items found while verifying them.

B1 (BLOCKING): docker-publish.yml checked out at `ref: inputs.tag`
(e.g. v1.6.11), but the v* git tag is created by `create-release`
which runs LAST in the job graph — after `publish-docker`. So on every
push-to-main triggered release (the documented primary path) the
checkout would fail with `fatal: couldn't find remote ref v1.6.11`.
Switch to `ref: github.sha`: same triggering commit the build and
prerelease-docker jobs used, exists at the moment publish-docker
runs for every event type, and still satisfies the original goal
of pinning .trivyignore to the scanned commit.

B2 (BLOCKING): cleanup-on-rejection referenced env-scoped
DOCKER_USERNAME / DOCKER_PASSWORD but had no `environment: release`,
so those secrets resolved to empty strings and the Docker Hub login
exited 1 — leaving the orphan tags + cosign artifacts the cleanup
was meant to remove. Add `environment: release`. The `release` env
approval was already granted upstream in the run, so no new prompt.

H1: monitor-pypi's `Wait for PyPI publish workflow to complete` step
piped `gh run list | jq ...` without `set -euo pipefail`, so a
transient gh failure (network, auth, rate limit) was swallowed by
jq returning empty input — burning the full 40-minute budget on
silent error rather than failing fast. Add `set -euo pipefail`.

H2: cleanup-on-rejection's step 2 did not delete the floating
`:prerelease` tag. If a release was rejected after prerelease-docker
re-pointed `:prerelease`, step 4 deleted the cosign signature for
that manifest while `:prerelease` still pointed at it — yielding a
window where pulling `:prerelease` returns an image the README
cosign-verify recipe cannot verify. Include `prerelease` in step 2's
delete loop; the next successful prerelease-docker re-creates it.

* chore(release): follow-up cleanups from PR review

Bundle of low-risk follow-ups from the multi-round review of this PR.
All same-scope as the atomicity refactor — staleness this PR introduced
in docs/comments, hardening adjacent to the changed code paths.

L1 (hardening): Drop `id-token: write` from `publish-docker` (caller)
and `docker-publish.yml` `promote` (callee). cosign VERIFY is a
read-only check against public Rekor/Fulcio; no GitHub OIDC token is
minted, so the permission is unused. Signing (which DOES need the
write) is exclusively in prerelease-docker.yml.

L7 (stale comments): prerelease-docker.yml's header comments still
referenced `trigger-workflows` — a job this PR split into
`publish-docker` + `trigger-pypi`. Replaced both occurrences.

L4 (doc): RELEASE_GUIDE.md "Emergency Procedures" claimed a manual
GitHub release "still triggers PyPI/Docker" — false under the new
design (publish.yml is repository_dispatch-only and docker-publish.yml
is workflow_call-only, neither listens on `release:` events). Replaced
with the actual recovery hierarchy.

L5 (doc): RELEASE_GUIDE.md and CI_CD_INFRASTRUCTURE.md pipeline chains
omitted the `provenance` job between `build` and `prerelease-docker`.

L6 (doc): RELEASE_GUIDE.md described monitor-pypi's timeout as a flat
"40 min" — the inner poll loop is 40 min but the surrounding
`timeout-minutes:` is 90 min, so the user-facing failure surface differs.

L4-bonus (doc): Manual-trigger section also claimed workflow_dispatch
takes "version and prerelease flag" inputs — release.yml's
`workflow_dispatch:` has no inputs defined. Replaced with the actual
behavior (reads __version__.py at HEAD; use tag-push for older versions).

M5 (doc): Both PAT_TOKEN comments overstated required scopes — claimed
`workflow` scope was needed (it isn't; it only governs editing
.github/workflows/ via the API) and didn't make explicit that
`public_repo` is rejected by `repository_dispatch`. Rewritten.

M8 (correctness): docker-publish.yml's cosign verify step targeted the
mutable `:VERSION` tag instead of `@${EXPECTED_DIGEST}`. The preceding
verify-promoted-tags step already confirms the tag resolves to the
expected digest, but using the tag here leaves a tag-resolution TOCTOU
window between the two steps. Trivy's re-scan already uses
`@${EXPECTED_DIGEST}`; switching cosign to the same reference is
consistent and races-free.

L2 (style): While editing the cosign step, routed `github.repository`
through an `env:` var (`REPO`) instead of direct `${{ }}` template
interpolation into shell args, matching the convention in the rest of
this workflow.

* chore(ci): bump harden-runner pin in docker-publish.yml to match other workflows

Last remaining v2.19.1 reference — every other workflow in this PR was
bumped to v2.19.3 when main moved forward. Auto-merge missed this one
because the surrounding hunk was in a conflict region.

* chore(release): fixes from multi-round subagent review of the full PR

Bundle of low-risk fixes confirmed by 30 subagents across 3 rounds.
None are blockers; all are worth fixing in-scope.

1. SLSA provenance builder.id: was github.workflow_ref, which inside a
   workflow_call callee resolves to the CALLER (release.yml), not the
   intended callee (prerelease-docker.yml). The Fulcio cert is still
   right (built from the job_workflow_ref OIDC claim), so cosign verify
   and slsa-verifier are unaffected, but raw-JSON consumers reading
   builder.id would see release.yml. Compose the value from
   github.repository + hardcoded path + github.ref instead — the `job`
   context has no workflow_ref property (actionlint confirms), and for
   a local-path workflow_call the callee's ref equals github.ref.

2. Dockerfile: set ENV LDR_DATA_DIR=/data so the VOLUME /data directive
   is actually load-bearing. Without it, paths.py falls back to
   platformdirs (~/.local/share/local-deep-research) which is inside the
   ephemeral container layer — bare docker run -v vol:/data users would
   silently lose data on docker rm.

3. trigger-pypi: forward prerelease=false in client_payload. publish.yml
   gates Test PyPI vs prod PyPI on client_payload.prerelease == true; if
   absent, the expression evaluates to '' and falls through to prod. Set
   false explicitly to remove the silent-fallback landmine.

4. Stale/misleading cosign comments in release.yml:
   - line 322: said "v2.6.0" while value is "v2.6.3" — corrected and
     noted GHSA-w6c6-c85g-mmv6 patch coverage
   - line 332: attributed --bundle to v3.0.2+ but it's been in v2.4.0+

5. release-gate.yml Node 20 → 24 (mirror publish.yml + Dockerfile).
   package.json declares engines.node >=24.0.0. The pip-install-check
   wheel is discarded so this was not a release-blocker, but the gate
   now validates the actual ship runtime.

6. README cosign-verify recipe:
   - Guard empty PLATFORM_DIGEST with a clear message for single-arch
     or pre-build-once-promote releases
   - Add docker buildx to prerequisites list
   - Spell out the legacy-verification substitution explicitly

* fix(ci): pin Trivy in promote step via SHA-pinned action wrapper

AI reviewer flagged docker-publish.yml's promote step as installing Trivy
via `sudo apt-get install -y trivy` with no version pin, reintroducing a
supply-chain risk to the release path. The prerelease scan in
prerelease-docker.yml uses the SHA-pinned aquasecurity/trivy-action
@ed142fd... wrapper with `version: 'v0.69.2'`, but the promote step
switched to the bare CLI and lost that protection.

Replace the apt-get install + raw `trivy image` invocation with the same
pinned action wrapper. Same scan semantics (CRITICAL,HIGH, ignore-unfixed,
.trivyignore, exit-code 1), same binary version (v0.69.2), same action
SHA — keeps the two scans consistent and removes the unpinned apt path.

* fix(ci): pin Trivy in release.yml build job — same fix as docker-publish.yml

R4 review caught that the AI-reviewer-flagged unpinned Trivy install also
exists in release.yml's `build` job, and is STRICTLY WORSE there because
that job carries `id-token: write` (for cosign keyless signing of SBOMs).

The attack chain that was open:
1. Aqua apt-repo compromise OR MITM of the unpinned GPG-key fetch
2. Malicious `trivy fs` binary installed
3. Binary exfiltrates ACTIONS_ID_TOKEN_REQUEST_URL/TOKEN env vars,
   minting an OIDC token under repo:LearningCircuit/local-deep-research
4. Binary tampers with sbom-spdx.json / sbom-cyclonedx.json contents
5. Next step `Sign release artifacts with Sigstore` cosign-signs the
   tampered SBOM with a legitimate Sigstore cert → fraudulent SBOM
   attached to the GitHub release with valid signature

Replace with the SHA-pinned aquasecurity/trivy-action@ed142fd0... (same
pin as docker-publish.yml and prerelease-docker.yml) using scan-type=fs
for the filesystem scan, with `version: 'v0.69.2'` to pin the binary
itself. Two separate action invocations (one per output format) because
the action takes a single format per run.

Also removes the unpinned `gpg --dearmor` of an unverified-fingerprint
public key, which the prior comment misleadingly called "secure".

* fix(ci): use TRIVY_USERNAME/PASSWORD env vars for trivy-action auth

The trivy-action README prescribes TRIVY_USERNAME/TRIVY_PASSWORD env
vars as the supported Docker Hub auth path. Even though docker/login-
action already wrote ~/.docker/config.json earlier in the job (and Trivy
reads it as a fallback), there's documented fragility with docker.io
credential helpers (aquasecurity/trivy#432, aquasecurity/trivy#8385)
that surfaces specifically on registry-pull scans like this one (unlike
the prerelease scan which uses a locally-loaded image).

The fallback would probably work today since localdeepresearch/
local-deep-research is public — anonymous pull would succeed even
without auth — but rate-limiting on anonymous Docker Hub pulls is
aggressive and the documented credential-helper quirks are real. Adding
the env vars uses the action's prescribed auth path, with the same
DOCKER_USERNAME/DOCKER_PASSWORD secrets already passed in via
workflow_call. Zero-cost defense-in-depth.

2026-05-22 21:52:46 +02:00

12 KiB

Raw Blame History

Release Guide

🚀 Automated Release Process

Releases are fully automated end-to-end whenever a PR that bumps src/local_deep_research/__version__.py is merged to main. No separate tag push, manual workflow trigger, or release-page click is required — the workflow detects the new version, runs all gates, cuts the GitHub release, and (after one approval click) publishes to PyPI and Docker Hub.

PRs that don't touch __version__.py merge normally but skip the release pipeline (the version-check job sees the tag already exists and short-circuits everything downstream).

📋 How Releases Work

1. Automatic Release Creation

Trigger: Push to main whose __version__.py resolves to a tag that does not yet exist as a GitHub release. In practice this means "merge a PR that bumps __version__.py". Tag pushes (v*.*.*) and manual workflow_dispatch runs also trigger the pipeline and bypass the version-exists check.
Version: Read from src/local_deep_research/__version__.py by the version-check job; the tag is v<version>.
Release body: Composed by .github/workflows/release.yml from three sources:
1. AI narrative generated by OpenRouter (vars.AI_MODEL, default moonshotai/kimi-k2-thinking). The model receives the rendered hand-written notes, the auto-generated PR list, every PR's title + body (batched via one GraphQL call), and the diff between the previous release tag and this one (filtered to drop lockfiles, generated docs, SBOM, static assets, and binary patches; capped at 700k chars).
2. Hand-written notes from docs/release_notes/<version>.md — rendered from contributor-supplied changelog.d/*.md fragments by the workflow itself at release time (see Release-notes flow below). No manual pdm run towncrier build is required before merging the bump.
3. Auto-generated PR list from GitHub's generate-notes API, label-categorized.
No duplicates: If a release for v<version> already exists, the version-check job sets should_release=false and every downstream job (security gate, CI gate, build, publish) is skipped.

2. Approval and Publishing

The release pipeline uses the release GitHub environment to gate the publish steps. DOCKER_USERNAME / DOCKER_PASSWORD are scoped to that environment, so any job that pushes to Docker Hub must declare environment: release and therefore goes through the approval gate.

When you merge to main (or push a tag), the pipeline runs in this order:

Security gates + CI gates run automatically.
build job runs (version pin, SBOM, Sigstore bundles), then provenance job generates SLSA provenance for those artifacts.
One release env approval prompt in release.yml. Approving unlocks all release-env jobs in the same run, which then execute sequentially:
1. prerelease-docker — canonical multi-arch Docker build, cosign sign, SBOM/SLSA attestations, push as prerelease-v<ver>-<sha> and re-point the floating :prerelease tag.
2. publish-docker — retags the prerelease manifest as :1.6.9, :1.6, :latest (no rebuild, digest-preserving), then re-verifies digest + cosign + Trivy on the promoted tag.
3. trigger-pypi — dispatches publish.yml via repository_dispatch (PyPI Trusted Publishing requires the publish step to run in a top-level workflow, so this can't be a reusable workflow_call).
4. monitor-pypi — polls publish.yml for completion. The inner polling loop times out at 40 minutes (after which the job fails); the surrounding GH Actions timeout-minutes is 90 to leave a safety margin around the poll budget.
5. create-release — publishes the GitHub Release with SBOM/sig/provenance assets. Runs last, gated on all of the above succeeding, so the public Release never points at missing Docker tags or a missing PyPI version.

If any of prerelease-docker, publish-docker, or monitor-pypi fails, create-release is skipped and no public GitHub Release is created. The cleanup-on-rejection job then handles failure-mode cleanup:

If publish-docker failed mid-retag (e.g., :1.6.9 landed but :latest failed), it rolls back any landed release tags BEFORE deleting prerelease tags and cosign artifacts (deleting cosign artifacts while release tags share the manifest digest would invalidate release-tag signatures).
If publish-docker succeeded but a later step (PyPI or create-release) failed, cleanup-on-rejection does NOT fire — Docker release tags exist and their cosign artifacts must stay. See "Recovery from PyPI failure" below.

Recovery from PyPI failure (atomicity hole)

The one orphan state the pipeline cannot fully clean up: publish-docker succeeded, PyPI failed. At this point Docker :1.6.9 / :1.6 / :latest exist and are signed; PyPI has nothing; no GitHub Release. monitor-pypi opens a tracking issue labeled ci-cd. To recover:

Inspect the publish.yml workflow run, fix the underlying cause.

Manually re-dispatch PyPI publish:

gh api repos/LearningCircuit/local-deep-research/dispatches \
  -f event_type=publish-pypi \
  -F 'client_payload[tag]=v<X.Y.Z>'

Once PyPI publishes successfully, manually create the GitHub Release from the existing tag (the SBOM/sig/provenance artifacts are still uploaded as workflow artifacts on the failed release.yml run; you can download them and attach manually, or re-run create-release manually if the run is still re-runnable in the Actions UI).

Earlier iterations of this refactor described a single approval gate with a pre-approval testing window. That design required DOCKER_USERNAME / DOCKER_PASSWORD to be repo-level secrets so the canonical build could run without env approval. They are env-scoped to release instead, so the gate sits in front of the build. The atomicity refactor preserves this single-approval model — one click unlocks the whole chain, and create-release runs last so the "published Release with broken artifacts" failure mode is closed.

👥 Who Can Release

Code owners (defined in .github/CODEOWNERS):

@LearningCircuit
@hashedviking
@djpetti

📝 Release Workflow

For Regular Releases:

Bump version in src/local_deep_research/__version__.py (or merge the auto-bump PR opened by .github/workflows/version_check.yml).
Merge to main → Release automatically created. The release workflow renders fragments from changelog.d/*.md into docs/release_notes/<X.Y.Z>.md, composes the body (AI narrative + rendered changelog + auto PR list), and publishes the GitHub release.
Approve publishing in GitHub Actions (PyPI/Docker).
Merge the cleanup PR opened automatically by the cleanup-changelog job (titled chore: clear changelog fragments for <X.Y.Z>). It persists docs/release_notes/<X.Y.Z>.md and removes the consumed fragments from changelog.d/. Squash-merge — the diff has no review value beyond a sanity check that the rendered notes look right.

To preview the rendered notes locally before merging the bump:

pdm run towncrier build --draft --version <X.Y.Z>

(--draft writes nothing.) Skip if no fragments exist for this release — the workflow tolerates a missing per-version file (warns and proceeds with auto-notes plus AI summary only).

For Hotfixes:

Create hotfix branch from main
Make minimal fix
Bump patch version (e.g., 0.4.3 → 0.4.4)
Fast-track review by code owners
Merge to main → Automatic release

🔧 Manual Release Options

Option A: Manual Trigger

Go to Actions → "Create Release" → "Run workflow"
No inputs are required: the workflow reads the version from src/local_deep_research/__version__.py at HEAD. To release an older or different version, use Option B (push a version tag).

Option B: Version Tags

git tag v0.4.3 && git push origin v0.4.3
Automatically creates release; the workflow uses the tag's commit SHA (not main HEAD), so this is the correct path for backporting.

🛡️ Branch Protection

Main branch is protected
Required reviews from code owners
No direct pushes - only via approved PRs
Status checks must pass (CI tests)

📦 Version Numbering

Follow Semantic Versioning:

Major (X.0.0): Breaking changes
Minor (0.X.0): New features, backward compatible
Patch (0.0.X): Bug fixes, backward compatible

🚨 Emergency Procedures

If automation fails, do NOT create a GitHub release through the UI as the first recovery step — under the atomicity refactor, a manually created GitHub release does NOT trigger publish.yml (it listens only on repository_dispatch) and does NOT trigger docker-publish.yml (workflow_call only). The downstream release: listeners that DO fire (backwards-compatibility.yml, sbom.yml) are observability-only.

Recovery, in order of preference:

Check workflow logs in GitHub Actions to identify which job failed, and use the targeted recovery for that failure mode:
- PyPI failure with Docker already promoted: see Recovery from PyPI failure above.
- Any other failure: re-run the failed job via the Actions UI if it's still re-runnable (typically within 30 days).
Re-trigger the full pipeline via workflow_dispatch if re-running individual jobs isn't possible. Safe for digest-keyed cosign verification — old digests remain valid because their cosign artifacts persist; the new run produces a new digest with its own signatures.
Contact code owners if recovery requires manual Docker Hub or PyPI intervention.

📝 Release-notes flow (towncrier news fragments)

Hand-written release notes are assembled from per-PR fragments using towncrier. This replaces the older shared docs/release_notes/<version>.md model, which broke down at LDR's PR throughput (multiple PRs/day racing for the same file).

Contributor side

Each PR with user-visible behavior change drops one tiny markdown file:

changelog.d/<PR-number>.<category>.md

Categories: breaking, security, feature, bugfix, removal, misc (canonical list lives in [[tool.towncrier.type]] entries in pyproject.toml; the pre-commit hook reads it from there). Orphan fragments (no PR/issue number) use changelog.d/+<slug>.<category>.md. The pre-commit hook (recommend-release-notes) nudges contributors who add ≥20 source lines without a fragment, and validates filenames so a typo'd category doesn't silently vanish at render time. See changelog.d/README.md for the full convention.

Maintainer side

The release workflow handles the render. There is nothing to run manually before merging the version bump.

.github/workflows/release.yml (create-release job):

Sparse-checks-out changelog.d/ + pyproject.toml, installs towncrier~=24.8.
Runs towncrier build --yes --version <X.Y.Z> against the runner's throwaway workspace. Towncrier writes the rendered output to docs/release_notes/<X.Y.Z>.md (per the {version}-templated filename; single_file = false makes this a per-release file rather than appending to a master CHANGELOG) and removes the consumed fragments locally.
Reads the rendered file as input to the AI summary and as the "hand-written notes" section of the published GitHub release body.

Persistence to main happens in the cleanup-changelog job, which re-runs the same render against the release commit (github.sha) and opens a chore/post-release-cleanup-<X.Y.Z> PR with the deletions and the rendered file. Squash-merge it.

If changelog.d/ is empty (maintenance release with no fragments), the render step is skipped cleanly and the release proceeds with the AI narrative + auto PR list only — no hard failure.

Preview without committing

pdm run towncrier build --draft --version <X.Y.Z>

--draft renders to stdout without touching any files or fragments. Useful while iterating on a fragment locally.

📊 Release Checklist

Version updated in __version__.py (or auto-bump PR merged)
Code owner approval received
CI tests passing
Merge to main completed
Release automatically created
PyPI/Docker publishing approved
chore: clear changelog fragments for <X.Y.Z> PR merged

12 KiB Raw Blame History