Files
local-deep-research/scripts/ci/dump_url_map.py
LearningCircuit 903a2db8af ci(nuclei): authenticate DAST scan + seed URLs from Flask url_map (#3698)
* ci(nuclei): authenticate scan + seed URL list from Flask url_map

Previously the Nuclei DAST job ran against an unauthenticated single
target (`http://localhost:5000`) with no URL list. Because Nuclei is
template-driven (not a crawler) and the LDR app is auth-gated, the
scanner only ever saw `/auth/login`, the index, and a couple of
unauthenticated endpoints. The 2-minute scan over 10k templates produced
only 5 info-level findings, all of which were intentional design
choices (CSP `unsafe-inline`, SameSite=Lax, OPTIONS verb, form
detection) — i.e. the gate was effectively a green-checkmark.

Now the workflow:
  1. Pre-creates the standard CI `test_admin` user via the existing
     `init_test_database.py` helper (avoids slow registration + rate
     limits).
  2. Logs in via the real /auth/login flow with CSRF token, captures
     the Flask session cookie, and verifies via /auth/check.
  3. Dumps the Flask url_map (excluding parameterized routes, static,
     and POST-only endpoints) into urls.txt so Nuclei probes every
     blueprint route, not just `/`.
  4. Runs Nuclei with `-list urls.txt` and the authenticated session
     cookie via `-H "Cookie: session=..."`.
  5. Filters to severity >= low to drop the four info-level findings
     that are intentional design choices.

The session cookie is masked in logs via `::add-mask::` so it doesn't
leak into the run output. Test credentials match the convention used by
the playwright-webkit-tests and puppeteer-e2e-tests workflows.

Adds scripts/ci/dump_url_map.py as a small helper that imports
`create_app()` and iterates `app.url_map.iter_rules()` — reusable from
other DAST workflows (e.g. ZAP API scan) that benefit from URL seeding.

* ci(nuclei): address findings from review pass

Three differentiated review agents flagged five actionable items on the
authenticated-Nuclei PR. This commit addresses all five:

* dump_url_map.py: stop skipping parameterized routes. Substitute a
  Flask-converter-appropriate placeholder (int/float→1, uuid→all-zeros,
  default→"nuclei") so Nuclei still probes path-traversal / parameter-
  injection / SQLi templates against routes like /research/<research_id>
  and /api/research/<research_id>/status. Without this, the bulk of the
  authenticated app surface (history, research, API blueprints) was
  silently excluded — which defeats the PR's purpose.

* nuclei.yml -etags intrusive,dos,fuzz: now that Nuclei holds a real
  session, default templates could mutate state or DoS the runner. This
  is the standard exclusion set for authenticated DAST.

* nuclei.yml: replace `cat cookies.txt` in the missing-cookie error
  branch with a column-filtered `awk` that omits the value column. The
  cookie is masked via `::add-mask::` after this point, so the previous
  branch could leak the session token in CI logs if the extraction
  regex ever broke.

* nuclei.yml: add `sleep 2` between auth/check and the Nuclei step so
  the post-login background thread (settings migration + library init,
  see web/auth/routes.py:_perform_post_login_tasks) finishes before
  probes start and 500 on settings-dependent routes.

* nuclei.yml: drop `# pragma: allowlist secret` on TEST_PASSWORD. The
  repo uses gitleaks (.gitleaks.toml already allowlists `testpass123`),
  not detect-secrets — the pragma was dead weight.

Out of scope for this PR (recorded but not changed):
- 3-way credential drift (init_test_database.py / nuclei.yml /
  auth_helper.js all hardcode test_admin/testpass123)
- Nuclei binary version `latest` auto-updating (matches existing CI)
- create_app() side effects in dump_url_map.py (currently benign)
2026-04-27 23:11:40 +02:00

77 lines
2.2 KiB
Python
Executable File

#!/usr/bin/env python3
"""
Dump the Flask url_map to a file as one absolute URL per line.
Used by the Nuclei DAST workflow to seed a URL list so the scanner
probes the actual application surface (authenticated routes, API
endpoints, blueprints) instead of just the index page.
Parameterized routes (`/research/<string:research_id>/status`) are
emitted with a converter-appropriate placeholder so Nuclei still
exercises the path. The substituted URL will usually 404 (the resource
doesn't exist for the test user), but that is fine — Nuclei probes path
traversal, parameter injection, and SQLi templates against the URL
pattern, not against a specific resource.
Skips:
- The Flask `static` endpoint (asset serving, no app logic).
- Routes that don't accept GET — Nuclei templates almost exclusively
issue GET probes, so POST-only endpoints just generate 405s.
Usage:
python scripts/ci/dump_url_map.py http://127.0.0.1:5000 > urls.txt
"""
import re
import sys
# Map Flask URL converters to a placeholder that satisfies the converter
# so Flask routes the request to the handler instead of 404-ing at the
# converter stage. Anything not listed falls back to a plain string.
_PLACEHOLDERS = {
"int": "1",
"float": "1",
"uuid": "00000000-0000-0000-0000-000000000000",
}
_DEFAULT_PLACEHOLDER = "nuclei"
_PARAM_RE = re.compile(
r"<(?:(?P<conv>[a-zA-Z_][a-zA-Z0-9_]*)(?:\([^)]*\))?:)?(?P<name>[a-zA-Z_][a-zA-Z0-9_]*)>"
)
def _substitute(match: "re.Match[str]") -> str:
return _PLACEHOLDERS.get(match.group("conv") or "", _DEFAULT_PLACEHOLDER)
def main() -> int:
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} BASE_URL", file=sys.stderr)
return 2
base_url = sys.argv[1].rstrip("/")
from local_deep_research.web.app_factory import create_app
app, _ = create_app()
seen: set[str] = set()
for rule in app.url_map.iter_rules():
if rule.endpoint == "static":
continue
if "GET" not in (rule.methods or set()):
continue
path = _PARAM_RE.sub(_substitute, rule.rule)
url = f"{base_url}{path}"
if url in seen:
continue
seen.add(url)
print(url)
return 0
if __name__ == "__main__":
sys.exit(main())