🧹 chore: Cap PR Indexes at 3 and Add Delete-Before-Sync (#12672)

* fix: add docker system prune before image pull to prevent disk exhaustion The 60GB droplet filled up after ~40 deploys because each docker compose pull leaves the previous image's layers as dangling/unused. The gitnexus image is ~700MB, so ~40 stale copies ≈ 28GB of dead layers. Combined with indexes, OS, and Docker's build cache, the disk hits 100% and the next pull fails with 'no space left on device'. Add a docker system prune -af --volumes BEFORE pulling the new image on every deploy. This removes stopped containers, unused networks, all images not referenced by a running container, and build cache. Running containers are never touched. Typically frees 1-2GB per deploy (the previous image's layers). Also add a hard 2GB free-space guard after prune so the deploy fails with a clear error instead of letting docker pull attempt a 700MB extract onto a near-full disk. * fix: cap PR indexes at 3 + delete-before-sync for 10GB disk The 10GB droplet has ~2GB free. Each index is ~130MB, so 7 PR indexes (~900MB) plus main+dev (~260MB) plus the ~700MB Docker image leaves almost nothing for image pulls. The deploy failed with 'no space left on device' during docker compose pull. Three changes: 1. Cap PR indexes at MAX_PR_INDEXES=3. The resolve step now sorts PR artifacts by created_at descending and only keeps the 3 most recent. Older PR indexes are logged as evicted and their droplet folders get cleaned by the prune step. 2. Prune BEFORE sync (was after). Freeing disk space from evicted indexes before rsyncing new data is critical on a tight disk. The old order (sync then prune) could briefly hold both old evicted indexes and newly-uploaded ones simultaneously. 3. Delete-before-sync for every index, including main/dev. Instead of rsync --delete (which transfers new files then removes extras), rm -rf the target folder before rsync so the disk never holds both old and new copies of the same index (~260MB saved per index). Main/dev are only deleted when a fresh artifact is about to replace them — never evicted between deploys. Budget on 10GB disk: OS + Docker engine: ~4.0 GB Docker image (running): ~0.7 GB main + dev indexes: ~0.26 GB 3 PR indexes: ~0.39 GB Docker prune headroom: ~0.7 GB (for image pull) Free: ~3.9 GB * refine: restrict automatic PR indexing to danny-avila authored PRs With 200+ open PRs and a 10GB disk capped at 3 served PR indexes, auto-indexing every contributor PR burns CI minutes for artifacts that will mostly be evicted before anyone queries them. Narrow the pull_request auto-trigger to PRs authored by danny-avila only. Other contributors' PRs can still be indexed on demand via /gitnexus index (contributor-gated comment command) or manual workflow_dispatch — both arrive as workflow_dispatch events and bypass the pull_request filter entirely. * fix: drop --volumes from docker system prune to preserve Caddy TLS state The deploy workflow explicitly handles a caddy-not-running state later in the same step. If Caddy is stopped when the prune runs, --volumes deletes the caddy-data and caddy-config volumes (TLS certs + ACME account keys), forcing a Let's Encrypt re-issuance on next start. LE rate-limits to 5 certs per domain per week, so repeated wipes could brick HTTPS for days. docker system prune -af (without --volumes) still removes stopped containers, unused networks, all dangling/unreferenced images, and build cache — which is where the disk savings come from. Named volumes are left untouched. * fix: rsync-then-swap instead of delete-before-sync The delete-before-sync pattern removed the live index BEFORE rsync ran. If rsync failed (SSH timeout, disk pressure, network error), the index was already gone — production served nothing for that repo until a later deploy succeeded. Replace with rsync-then-swap: upload to a .new temp directory, and only rm + mv into place after rsync succeeds. On rsync failure, the .new temp is cleaned up and the old index stays live. The cost is ~130MB of extra disk while both old and new coexist, but the prune step runs first and frees evicted PR indexes, so this fits comfortably on the 10GB disk. * fix: fail deploy on main/dev rsync failure, soft-fail PRs only The rsync-then-swap pattern downgraded ALL failures to a warning, so the deploy continued even when LibreChat or LibreChat-dev failed to sync. The job would pull the new image, restart the container, and report success while serving stale or missing core indexes. Split by criticality: main/dev rsync failures now exit 1 (aborting the deploy before the container restart). PR index failures remain soft-fail with a warning — a missing PR index is inconvenient but shouldn't take the whole server down.
2026-06-15 23:43:06 +03:00 · 2026-04-15 09:46:48 -04:00
parent f6b73938af
commit 76e9543f99
2 changed files with 100 additions and 31 deletions
--- a/.github/workflows/gitnexus-deploy.yml
+++ b/.github/workflows/gitnexus-deploy.yml
@@ -247,7 +247,18 @@ jobs:
              }
            }

-            for (const { pr, artifactName, fresh } of prMatches) {
+            // Cap to the N most recent PR indexes by artifact creation time.
+            // On a 10GB droplet each index is ~130MB; 3 PRs + main + dev ≈
+            // 650MB of index data, leaving headroom for the ~700MB Docker image
+            // and OS. Older PR indexes are evicted by the prune step.
+            const MAX_PR_INDEXES = 3;
+            prMatches.sort(
+              (a, b) => new Date(b.fresh.created_at) - new Date(a.fresh.created_at),
+            );
+            const keptPrs = prMatches.slice(0, MAX_PR_INDEXES);
+            const evictedPrs = prMatches.slice(MAX_PR_INDEXES);
+
+            for (const { pr, artifactName, fresh } of keptPrs) {
              serve.push({
                name: `LibreChat-pr-${pr.number}`,
                artifactName,
@@ -255,7 +266,13 @@ jobs:
              });
              core.info(`PR #${pr.number}: run ${fresh.workflow_run.id} -> LibreChat-pr-${pr.number}`);
            }
-            core.info(`Resolved ${prMatches.length} PR indexes out of ${openPrs.length} open PRs`);
+            if (evictedPrs.length) {
+              core.info(
+                `Evicted ${evictedPrs.length} older PR indexes (cap=${MAX_PR_INDEXES}): ` +
+                  evictedPrs.map((e) => `#${e.pr.number}`).join(', '),
+              );
+            }
+            core.info(`Serving ${keptPrs.length} PR indexes out of ${prMatches.length} with artifacts (${openPrs.length} open PRs total)`);

            if (!serve.length) {
              core.setFailed('No indexes to serve');
@@ -360,37 +377,22 @@ jobs:
            .do/gitnexus/Caddyfile \
            "$SSH_USER@$SSH_HOST:/opt/gitnexus/"

-      - name: Rsync indexes and prune stale ones
+      - name: Prune stale indexes then sync fresh ones
        env:
          SSH_USER: ${{ secrets.GITNEXUS_DO_USER }}
          SSH_HOST: ${{ secrets.GITNEXUS_DO_HOST }}
          ACTIVE_NAMES: ${{ steps.resolve.outputs.active_names }}
        run: |
          set -e
-          # Push every active index up
-          for dir in staging/*/; do
-            [ -d "$dir" ] || continue
-            name=$(basename "$dir")
-            echo "Syncing $name"
-            ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
-              "mkdir -p /opt/gitnexus/indexes/$name"
-            rsync -az --delete -e "ssh -i ~/.ssh/deploy_key" \
-              "$dir" \
-              "$SSH_USER@$SSH_HOST:/opt/gitnexus/indexes/$name/"
-          done
-
-          # Prune any folders on the droplet that aren't in the active set.
-          # This cleans up closed PRs the cleanup workflow might have missed,
-          # and is safe because main/dev/PR-<N> are always present if active.
+          # ── Step 1: prune FIRST ────────────────────────────────
+          # Remove any folders on the droplet that aren't in the active set.
+          # This frees disk BEFORE rsyncing new data, which matters on a
+          # 10GB disk where each index is ~130MB.
          echo "Pruning stale indexes (keeping: $ACTIVE_NAMES)"
          ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
            ACTIVE_NAMES="$ACTIVE_NAMES" bash <<'REMOTE'
            set -e
            cd /opt/gitnexus/indexes || exit 0
-            # nullglob makes `for dir in */` expand to nothing when the
-            # directory is empty (first deploy), instead of the literal
-            # string "*/". Explicit no-op > relying on rm -f to silently
-            # tolerate a nonexistent file named "*".
            shopt -s nullglob
            IFS=',' read -ra ACTIVE <<< "$ACTIVE_NAMES"
            for dir in */; do
@@ -404,8 +406,49 @@ jobs:
                rm -rf "$dir"
              fi
            done
+            echo "Disk after prune:"
+            df -h / | tail -1
          REMOTE

+          # ── Step 2: rsync-then-swap ─────────────────────────────
+          # Upload each index to a temp directory, then atomically swap
+          # it into place. If rsync fails, the old index survives intact
+          # and the partial temp dir is cleaned up — no production data
+          # is lost. The brief period where both old + new exist costs
+          # ~130MB of extra disk, but the prune step already freed
+          # space from evicted PR indexes so this fits on a 10GB disk.
+          for dir in staging/*/; do
+            [ -d "$dir" ] || continue
+            name=$(basename "$dir")
+            echo "Syncing $name (rsync-then-swap)"
+            ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
+              "mkdir -p /opt/gitnexus/indexes/${name}.new"
+            if rsync -az -e "ssh -i ~/.ssh/deploy_key" \
+              "$dir" \
+              "$SSH_USER@$SSH_HOST:/opt/gitnexus/indexes/${name}.new/"; then
+              # Swap: remove old, rename new into place
+              ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
+                "rm -rf /opt/gitnexus/indexes/$name && mv /opt/gitnexus/indexes/${name}.new /opt/gitnexus/indexes/$name"
+              echo "  $name swapped successfully"
+            else
+              # Clean up the partial temp dir
+              ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
+                "rm -rf /opt/gitnexus/indexes/${name}.new"
+              # main/dev are critical — abort the deploy so the failure
+              # is visible and the container isn't restarted with stale
+              # or missing data. PR indexes are best-effort.
+              case "$name" in
+                LibreChat|LibreChat-dev)
+                  echo "::error::rsync failed for critical index $name — aborting deploy"
+                  exit 1
+                  ;;
+                *)
+                  echo "::warning::rsync failed for PR index $name — keeping previous index"
+                  ;;
+              esac
+            fi
+          done
+
      - name: Pull image, restart gitnexus, reload Caddy, wait for healthy
        env:
          SSH_USER: ${{ secrets.GITNEXUS_DO_USER }}
@@ -414,6 +457,31 @@ jobs:
          ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" bash <<'REMOTE'
            set -e
            cd /opt/gitnexus
+
+            # ── Disk cleanup ──────────────────────────────────────
+            # Docker accumulates old image layers, dangling images, and
+            # build cache across deploys. On a 60GB droplet with a 700MB+
+            # gitnexus image, this fills the disk after ~40 deploys.
+            # Prune everything not used by currently-running containers
+            # BEFORE pulling the new image so the extract has room.
+            echo "Disk before cleanup:"
+            df -h / | tail -1
+            # Omit --volumes: Caddy's caddy-data and caddy-config volumes
+            # hold TLS certificates and ACME state. If Caddy happens to be
+            # stopped when this runs (the workflow handles that case later),
+            # --volumes would wipe them, forcing Let's Encrypt re-issuance
+            # and risking rate-limit lockout (5 certs/domain/week).
+            docker system prune -af 2>/dev/null || true
+            echo "Disk after cleanup:"
+            df -h / | tail -1
+
+            # Fail fast if disk is critically low even after prune
+            AVAIL_MB=$(df --output=avail -m / | tail -1 | tr -d ' ')
+            if [ "$AVAIL_MB" -lt 2048 ]; then
+              echo "::error::Disk critically low (${AVAIL_MB}MB free). Aborting deploy."
+              exit 1
+            fi
+
            docker compose pull gitnexus
            docker compose up -d --force-recreate gitnexus

--- a/.github/workflows/gitnexus-index.yml
+++ b/.github/workflows/gitnexus-index.yml
@@ -45,17 +45,18 @@ env:

 jobs:
  index:
-    # Allow push + dispatch unconditionally; filter native pull_request
-    # events to contributors only. The /gitnexus command workflow does
-    # its own contributor-commenter check before it dispatches this
-    # workflow, so workflow_dispatch is always trusted here — including
-    # the case where the commenter wants to index a non-contributor or
-    # fork PR (the command uses refs/pull/<N>/head so checkout resolves).
+    # Push + dispatch run unconditionally. Native pull_request events
+    # are restricted to PRs authored by danny-avila only — this keeps
+    # automatic CI spend low on a repo with 200+ open PRs.
+    #
+    # Other contributors' PRs can still be indexed on demand:
+    #   - /gitnexus index    (PR comment command, contributor-gated)
+    #   - workflow_dispatch   (manual dispatch from Actions UI)
+    # Both bypass this filter because they arrive as workflow_dispatch,
+    # not pull_request.
    if: |
      github.event_name != 'pull_request' ||
-      github.event.pull_request.author_association == 'OWNER' ||
-      github.event.pull_request.author_association == 'MEMBER' ||
-      github.event.pull_request.author_association == 'COLLABORATOR'
+      github.event.pull_request.user.login == 'danny-avila'
    runs-on: ubuntu-latest
    timeout-minutes: 25
    steps: