mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
* ci(nuclei): authenticate scan + seed URL list from Flask url_map Previously the Nuclei DAST job ran against an unauthenticated single target (`http://localhost:5000`) with no URL list. Because Nuclei is template-driven (not a crawler) and the LDR app is auth-gated, the scanner only ever saw `/auth/login`, the index, and a couple of unauthenticated endpoints. The 2-minute scan over 10k templates produced only 5 info-level findings, all of which were intentional design choices (CSP `unsafe-inline`, SameSite=Lax, OPTIONS verb, form detection) — i.e. the gate was effectively a green-checkmark. Now the workflow: 1. Pre-creates the standard CI `test_admin` user via the existing `init_test_database.py` helper (avoids slow registration + rate limits). 2. Logs in via the real /auth/login flow with CSRF token, captures the Flask session cookie, and verifies via /auth/check. 3. Dumps the Flask url_map (excluding parameterized routes, static, and POST-only endpoints) into urls.txt so Nuclei probes every blueprint route, not just `/`. 4. Runs Nuclei with `-list urls.txt` and the authenticated session cookie via `-H "Cookie: session=..."`. 5. Filters to severity >= low to drop the four info-level findings that are intentional design choices. The session cookie is masked in logs via `::add-mask::` so it doesn't leak into the run output. Test credentials match the convention used by the playwright-webkit-tests and puppeteer-e2e-tests workflows. Adds scripts/ci/dump_url_map.py as a small helper that imports `create_app()` and iterates `app.url_map.iter_rules()` — reusable from other DAST workflows (e.g. ZAP API scan) that benefit from URL seeding. * ci(nuclei): address findings from review pass Three differentiated review agents flagged five actionable items on the authenticated-Nuclei PR. This commit addresses all five: * dump_url_map.py: stop skipping parameterized routes. Substitute a Flask-converter-appropriate placeholder (int/float→1, uuid→all-zeros, default→"nuclei") so Nuclei still probes path-traversal / parameter- injection / SQLi templates against routes like /research/<research_id> and /api/research/<research_id>/status. Without this, the bulk of the authenticated app surface (history, research, API blueprints) was silently excluded — which defeats the PR's purpose. * nuclei.yml -etags intrusive,dos,fuzz: now that Nuclei holds a real session, default templates could mutate state or DoS the runner. This is the standard exclusion set for authenticated DAST. * nuclei.yml: replace `cat cookies.txt` in the missing-cookie error branch with a column-filtered `awk` that omits the value column. The cookie is masked via `::add-mask::` after this point, so the previous branch could leak the session token in CI logs if the extraction regex ever broke. * nuclei.yml: add `sleep 2` between auth/check and the Nuclei step so the post-login background thread (settings migration + library init, see web/auth/routes.py:_perform_post_login_tasks) finishes before probes start and 500 on settings-dependent routes. * nuclei.yml: drop `# pragma: allowlist secret` on TEST_PASSWORD. The repo uses gitleaks (.gitleaks.toml already allowlists `testpass123`), not detect-secrets — the pragma was dead weight. Out of scope for this PR (recorded but not changed): - 3-way credential drift (init_test_database.py / nuclei.yml / auth_helper.js all hardcode test_admin/testpass123) - Nuclei binary version `latest` auto-updating (matches existing CI) - create_app() side effects in dump_url_map.py (currently benign)
77 lines
2.2 KiB
Python
Executable File
77 lines
2.2 KiB
Python
Executable File
#!/usr/bin/env python3
|
|
"""
|
|
Dump the Flask url_map to a file as one absolute URL per line.
|
|
|
|
Used by the Nuclei DAST workflow to seed a URL list so the scanner
|
|
probes the actual application surface (authenticated routes, API
|
|
endpoints, blueprints) instead of just the index page.
|
|
|
|
Parameterized routes (`/research/<string:research_id>/status`) are
|
|
emitted with a converter-appropriate placeholder so Nuclei still
|
|
exercises the path. The substituted URL will usually 404 (the resource
|
|
doesn't exist for the test user), but that is fine — Nuclei probes path
|
|
traversal, parameter injection, and SQLi templates against the URL
|
|
pattern, not against a specific resource.
|
|
|
|
Skips:
|
|
- The Flask `static` endpoint (asset serving, no app logic).
|
|
- Routes that don't accept GET — Nuclei templates almost exclusively
|
|
issue GET probes, so POST-only endpoints just generate 405s.
|
|
|
|
Usage:
|
|
python scripts/ci/dump_url_map.py http://127.0.0.1:5000 > urls.txt
|
|
"""
|
|
|
|
import re
|
|
import sys
|
|
|
|
|
|
# Map Flask URL converters to a placeholder that satisfies the converter
|
|
# so Flask routes the request to the handler instead of 404-ing at the
|
|
# converter stage. Anything not listed falls back to a plain string.
|
|
_PLACEHOLDERS = {
|
|
"int": "1",
|
|
"float": "1",
|
|
"uuid": "00000000-0000-0000-0000-000000000000",
|
|
}
|
|
_DEFAULT_PLACEHOLDER = "nuclei"
|
|
|
|
_PARAM_RE = re.compile(
|
|
r"<(?:(?P<conv>[a-zA-Z_][a-zA-Z0-9_]*)(?:\([^)]*\))?:)?(?P<name>[a-zA-Z_][a-zA-Z0-9_]*)>"
|
|
)
|
|
|
|
|
|
def _substitute(match: "re.Match[str]") -> str:
|
|
return _PLACEHOLDERS.get(match.group("conv") or "", _DEFAULT_PLACEHOLDER)
|
|
|
|
|
|
def main() -> int:
|
|
if len(sys.argv) != 2:
|
|
print(f"Usage: {sys.argv[0]} BASE_URL", file=sys.stderr)
|
|
return 2
|
|
|
|
base_url = sys.argv[1].rstrip("/")
|
|
|
|
from local_deep_research.web.app_factory import create_app
|
|
|
|
app, _ = create_app()
|
|
|
|
seen: set[str] = set()
|
|
for rule in app.url_map.iter_rules():
|
|
if rule.endpoint == "static":
|
|
continue
|
|
if "GET" not in (rule.methods or set()):
|
|
continue
|
|
path = _PARAM_RE.sub(_substitute, rule.rule)
|
|
url = f"{base_url}{path}"
|
|
if url in seen:
|
|
continue
|
|
seen.add(url)
|
|
print(url)
|
|
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|