🧹 chore: Cap PR Indexes at 3 and Add Delete-Before-Sync (#12672)

* fix: add docker system prune before image pull to prevent disk exhaustion

The 60GB droplet filled up after ~40 deploys because each
docker compose pull leaves the previous image's layers as
dangling/unused. The gitnexus image is ~700MB, so ~40 stale
copies ≈ 28GB of dead layers. Combined with indexes, OS, and
Docker's build cache, the disk hits 100% and the next pull fails
with 'no space left on device'.

Add a docker system prune -af --volumes BEFORE pulling the new
image on every deploy. This removes stopped containers, unused
networks, all images not referenced by a running container, and
build cache. Running containers are never touched. Typically
frees 1-2GB per deploy (the previous image's layers).

Also add a hard 2GB free-space guard after prune so the deploy
fails with a clear error instead of letting docker pull attempt
a 700MB extract onto a near-full disk.

* fix: cap PR indexes at 3 + delete-before-sync for 10GB disk

The 10GB droplet has ~2GB free. Each index is ~130MB, so 7 PR indexes
(~900MB) plus main+dev (~260MB) plus the ~700MB Docker image leaves
almost nothing for image pulls. The deploy failed with 'no space left
on device' during docker compose pull.

Three changes:

1. Cap PR indexes at MAX_PR_INDEXES=3. The resolve step now sorts
   PR artifacts by created_at descending and only keeps the 3 most
   recent. Older PR indexes are logged as evicted and their droplet
   folders get cleaned by the prune step.

2. Prune BEFORE sync (was after). Freeing disk space from evicted
   indexes before rsyncing new data is critical on a tight disk. The
   old order (sync then prune) could briefly hold both old evicted
   indexes and newly-uploaded ones simultaneously.

3. Delete-before-sync for every index, including main/dev. Instead
   of rsync --delete (which transfers new files then removes extras),
   rm -rf the target folder before rsync so the disk never holds both
   old and new copies of the same index (~260MB saved per index).
   Main/dev are only deleted when a fresh artifact is about to replace
   them — never evicted between deploys.

Budget on 10GB disk:
  OS + Docker engine:    ~4.0 GB
  Docker image (running): ~0.7 GB
  main + dev indexes:    ~0.26 GB
  3 PR indexes:          ~0.39 GB
  Docker prune headroom: ~0.7 GB (for image pull)
  Free:                  ~3.9 GB

* refine: restrict automatic PR indexing to danny-avila authored PRs

With 200+ open PRs and a 10GB disk capped at 3 served PR indexes,
auto-indexing every contributor PR burns CI minutes for artifacts
that will mostly be evicted before anyone queries them.

Narrow the pull_request auto-trigger to PRs authored by danny-avila
only. Other contributors' PRs can still be indexed on demand via
/gitnexus index (contributor-gated comment command) or manual
workflow_dispatch — both arrive as workflow_dispatch events and
bypass the pull_request filter entirely.

* fix: drop --volumes from docker system prune to preserve Caddy TLS state

The deploy workflow explicitly handles a caddy-not-running state later
in the same step. If Caddy is stopped when the prune runs, --volumes
deletes the caddy-data and caddy-config volumes (TLS certs + ACME
account keys), forcing a Let's Encrypt re-issuance on next start.
LE rate-limits to 5 certs per domain per week, so repeated wipes
could brick HTTPS for days.

docker system prune -af (without --volumes) still removes stopped
containers, unused networks, all dangling/unreferenced images, and
build cache — which is where the disk savings come from. Named
volumes are left untouched.

* fix: rsync-then-swap instead of delete-before-sync

The delete-before-sync pattern removed the live index BEFORE rsync
ran. If rsync failed (SSH timeout, disk pressure, network error),
the index was already gone — production served nothing for that
repo until a later deploy succeeded.

Replace with rsync-then-swap: upload to a .new temp directory, and
only rm + mv into place after rsync succeeds. On rsync failure,
the .new temp is cleaned up and the old index stays live. The cost
is ~130MB of extra disk while both old and new coexist, but the
prune step runs first and frees evicted PR indexes, so this fits
comfortably on the 10GB disk.

* fix: fail deploy on main/dev rsync failure, soft-fail PRs only

The rsync-then-swap pattern downgraded ALL failures to a warning,
so the deploy continued even when LibreChat or LibreChat-dev failed
to sync. The job would pull the new image, restart the container,
and report success while serving stale or missing core indexes.

Split by criticality: main/dev rsync failures now exit 1 (aborting
the deploy before the container restart). PR index failures remain
soft-fail with a warning — a missing PR index is inconvenient but
shouldn't take the whole server down.
This commit is contained in:
Danny Avila
2026-04-15 09:46:48 -04:00
committed by GitHub
parent f6b73938af
commit 76e9543f99
2 changed files with 100 additions and 31 deletions

View File

@@ -247,7 +247,18 @@ jobs:
}
}
for (const { pr, artifactName, fresh } of prMatches) {
// Cap to the N most recent PR indexes by artifact creation time.
// On a 10GB droplet each index is ~130MB; 3 PRs + main + dev ≈
// 650MB of index data, leaving headroom for the ~700MB Docker image
// and OS. Older PR indexes are evicted by the prune step.
const MAX_PR_INDEXES = 3;
prMatches.sort(
(a, b) => new Date(b.fresh.created_at) - new Date(a.fresh.created_at),
);
const keptPrs = prMatches.slice(0, MAX_PR_INDEXES);
const evictedPrs = prMatches.slice(MAX_PR_INDEXES);
for (const { pr, artifactName, fresh } of keptPrs) {
serve.push({
name: `LibreChat-pr-${pr.number}`,
artifactName,
@@ -255,7 +266,13 @@ jobs:
});
core.info(`PR #${pr.number}: run ${fresh.workflow_run.id} -> LibreChat-pr-${pr.number}`);
}
core.info(`Resolved ${prMatches.length} PR indexes out of ${openPrs.length} open PRs`);
if (evictedPrs.length) {
core.info(
`Evicted ${evictedPrs.length} older PR indexes (cap=${MAX_PR_INDEXES}): ` +
evictedPrs.map((e) => `#${e.pr.number}`).join(', '),
);
}
core.info(`Serving ${keptPrs.length} PR indexes out of ${prMatches.length} with artifacts (${openPrs.length} open PRs total)`);
if (!serve.length) {
core.setFailed('No indexes to serve');
@@ -360,37 +377,22 @@ jobs:
.do/gitnexus/Caddyfile \
"$SSH_USER@$SSH_HOST:/opt/gitnexus/"
- name: Rsync indexes and prune stale ones
- name: Prune stale indexes then sync fresh ones
env:
SSH_USER: ${{ secrets.GITNEXUS_DO_USER }}
SSH_HOST: ${{ secrets.GITNEXUS_DO_HOST }}
ACTIVE_NAMES: ${{ steps.resolve.outputs.active_names }}
run: |
set -e
# Push every active index up
for dir in staging/*/; do
[ -d "$dir" ] || continue
name=$(basename "$dir")
echo "Syncing $name"
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
"mkdir -p /opt/gitnexus/indexes/$name"
rsync -az --delete -e "ssh -i ~/.ssh/deploy_key" \
"$dir" \
"$SSH_USER@$SSH_HOST:/opt/gitnexus/indexes/$name/"
done
# Prune any folders on the droplet that aren't in the active set.
# This cleans up closed PRs the cleanup workflow might have missed,
# and is safe because main/dev/PR-<N> are always present if active.
# ── Step 1: prune FIRST ────────────────────────────────
# Remove any folders on the droplet that aren't in the active set.
# This frees disk BEFORE rsyncing new data, which matters on a
# 10GB disk where each index is ~130MB.
echo "Pruning stale indexes (keeping: $ACTIVE_NAMES)"
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
ACTIVE_NAMES="$ACTIVE_NAMES" bash <<'REMOTE'
set -e
cd /opt/gitnexus/indexes || exit 0
# nullglob makes `for dir in */` expand to nothing when the
# directory is empty (first deploy), instead of the literal
# string "*/". Explicit no-op > relying on rm -f to silently
# tolerate a nonexistent file named "*".
shopt -s nullglob
IFS=',' read -ra ACTIVE <<< "$ACTIVE_NAMES"
for dir in */; do
@@ -404,8 +406,49 @@ jobs:
rm -rf "$dir"
fi
done
echo "Disk after prune:"
df -h / | tail -1
REMOTE
# ── Step 2: rsync-then-swap ─────────────────────────────
# Upload each index to a temp directory, then atomically swap
# it into place. If rsync fails, the old index survives intact
# and the partial temp dir is cleaned up — no production data
# is lost. The brief period where both old + new exist costs
# ~130MB of extra disk, but the prune step already freed
# space from evicted PR indexes so this fits on a 10GB disk.
for dir in staging/*/; do
[ -d "$dir" ] || continue
name=$(basename "$dir")
echo "Syncing $name (rsync-then-swap)"
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
"mkdir -p /opt/gitnexus/indexes/${name}.new"
if rsync -az -e "ssh -i ~/.ssh/deploy_key" \
"$dir" \
"$SSH_USER@$SSH_HOST:/opt/gitnexus/indexes/${name}.new/"; then
# Swap: remove old, rename new into place
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
"rm -rf /opt/gitnexus/indexes/$name && mv /opt/gitnexus/indexes/${name}.new /opt/gitnexus/indexes/$name"
echo " $name swapped successfully"
else
# Clean up the partial temp dir
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" \
"rm -rf /opt/gitnexus/indexes/${name}.new"
# main/dev are critical — abort the deploy so the failure
# is visible and the container isn't restarted with stale
# or missing data. PR indexes are best-effort.
case "$name" in
LibreChat|LibreChat-dev)
echo "::error::rsync failed for critical index $name — aborting deploy"
exit 1
;;
*)
echo "::warning::rsync failed for PR index $name — keeping previous index"
;;
esac
fi
done
- name: Pull image, restart gitnexus, reload Caddy, wait for healthy
env:
SSH_USER: ${{ secrets.GITNEXUS_DO_USER }}
@@ -414,6 +457,31 @@ jobs:
ssh -i ~/.ssh/deploy_key "$SSH_USER@$SSH_HOST" bash <<'REMOTE'
set -e
cd /opt/gitnexus
# ── Disk cleanup ──────────────────────────────────────
# Docker accumulates old image layers, dangling images, and
# build cache across deploys. On a 60GB droplet with a 700MB+
# gitnexus image, this fills the disk after ~40 deploys.
# Prune everything not used by currently-running containers
# BEFORE pulling the new image so the extract has room.
echo "Disk before cleanup:"
df -h / | tail -1
# Omit --volumes: Caddy's caddy-data and caddy-config volumes
# hold TLS certificates and ACME state. If Caddy happens to be
# stopped when this runs (the workflow handles that case later),
# --volumes would wipe them, forcing Let's Encrypt re-issuance
# and risking rate-limit lockout (5 certs/domain/week).
docker system prune -af 2>/dev/null || true
echo "Disk after cleanup:"
df -h / | tail -1
# Fail fast if disk is critically low even after prune
AVAIL_MB=$(df --output=avail -m / | tail -1 | tr -d ' ')
if [ "$AVAIL_MB" -lt 2048 ]; then
echo "::error::Disk critically low (${AVAIL_MB}MB free). Aborting deploy."
exit 1
fi
docker compose pull gitnexus
docker compose up -d --force-recreate gitnexus

View File

@@ -45,17 +45,18 @@ env:
jobs:
index:
# Allow push + dispatch unconditionally; filter native pull_request
# events to contributors only. The /gitnexus command workflow does
# its own contributor-commenter check before it dispatches this
# workflow, so workflow_dispatch is always trusted here — including
# the case where the commenter wants to index a non-contributor or
# fork PR (the command uses refs/pull/<N>/head so checkout resolves).
# Push + dispatch run unconditionally. Native pull_request events
# are restricted to PRs authored by danny-avila only — this keeps
# automatic CI spend low on a repo with 200+ open PRs.
#
# Other contributors' PRs can still be indexed on demand:
# - /gitnexus index (PR comment command, contributor-gated)
# - workflow_dispatch (manual dispatch from Actions UI)
# Both bypass this filter because they arrive as workflow_dispatch,
# not pull_request.
if: |
github.event_name != 'pull_request' ||
github.event.pull_request.author_association == 'OWNER' ||
github.event.pull_request.author_association == 'MEMBER' ||
github.event.pull_request.author_association == 'COLLABORATOR'
github.event.pull_request.user.login == 'danny-avila'
runs-on: ubuntu-latest
timeout-minutes: 25
steps: