local-deep-research

mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Author	SHA1	Message	Date
LearningCircuit	bc45bbf2e7	fix(ci): make LDR research workflow honestly fail on Python crash (#4226 ) * fix(ci): make LDR research workflow honestly fail on Python crash A real run (job 77511717371, PR #4225) crashed with glibc 'double free or corruption (!prev)' but the workflow reported success and the caller posted a hollow PR comment. Two cooperating defects: the script's exit code was discarded inside set +e / set -e, and `jq .` exits 0 on a zero-byte response.json so the JSON-shape check passed on empty input. Capture the exit code, harden the validation order (exit -> non-empty -> jq -e shape -> .error -> .research non-empty), tee stderr to a log surfaced in the ::error:: annotation, upload the artifact with if: always() so failed runs leave debuggable evidence, and flush stdout in a finally block in the script so a SIGABRT during interpreter shutdown after json.dumps can't drop the otherwise-completed output. Matches the house pattern from dockle.yml and the jq -e idiom from release-gate.yml. * chore(ci): enable faulthandler in ldr-research.py Dumps a Python traceback to stderr on SIGABRT/SIGSEGV/SIGFPE/SIGBUS/SIGILL before the signal re-fires. Pairs with the stderr-capture plumbing earlier in this PR: on the next glibc abort the ::error:: annotation will show "Fatal Python error: Aborted" plus the actual Python stack frame, making the deps-level investigation possible without a re-run. Verified locally: a deliberately aborted child process emits its frame through faulthandler before exiting on signal 6.	2026-05-24 09:45:14 +02:00
LearningCircuit	653707a556	fix(encoding): add encoding="utf-8" to bare open() / read_text / write_text in examples and scripts (#4118 ) Cleanup follow-up to #3797. The check-open-encoding hook was originally scoped with exclude: ^(tests/\|examples/\|scripts/) because those directories had ~45 pre-existing bare open() calls and addressing them was out of scope for the core Windows bug fix. This commit: * adds encoding="utf-8" to 45 read/write call sites under examples/ and scripts/ — JSON benchmark results, config-doc generators, workflow status pages, and the datetime-timezone pre-commit hook * narrows the hook exclude to ^tests/ only, so future regressions in examples/scripts/ are blocked at commit time Windows users running the benchmark scripts and config-doc generator would previously hit silent failures or UnicodeDecodeErrors on non-ASCII content under cp1252. The package itself was already protected by #3797.	2026-05-18 21:45:04 +02:00
LearningCircuit	2723331f67	chore(ci): cut workflow-status.md regen diff noise (#4066 ) The auto-regenerated workflow-status.md on every version-bump PR produced ~15 rows of churn that wasn't signal: - Status emoji column flipped between ✅ / · / ⏳ depending on which event last ran (e.g. backwards-compatibility flipped ✅→· because the most recent run was a skipped workflow_call, not because it regressed). The live badge column to its right is the source of truth for current status anyway, and run history lives in GitHub Actions itself. Drop the column. - Last activity buckets oscillated across this week / last week / 2 weeks ago for healthy daily/weekly workflows. Coarsen to last 30 days / 1-3 months ago / 3-6 months ago / long ago / never so a healthy workflow sits in one bucket indefinitely. Net effect: regenerations in steady state produce zero diff. Real signal (new stale/disabled workflows, aging past the 30d bucket) still surfaces.	2026-05-16 13:20:21 +02:00
LearningCircuit	9755a900eb	ci(research): extract reusable LDR-research workflow + add issue-trigger caller (#3987 ) * ci(research): extract reusable LDR-research workflow + add issue-trigger caller Three triggers will end up calling the same install-and-run-LDR plumbing (PR diff today, issue body now, Reddit posts later). Factor out the middle of the workflow into a reusable workflow so we don't have to maintain the same logic in three places, and add the issue-trigger caller on top of it. Changes: - .github/workflows/ldr-research-reusable.yml (new) — workflow_call workflow that takes a fully-assembled query and returns a comment-ready markdown blob via artifact. Inputs include forward-compat knobs the future Reddit caller will need (max-query-length, max-sources, comment-footer override, include-sources-section, output-truncate-chars). - .github/workflows/e2e-research-test.yml — refactored from a single job to three jobs (build-query → research-via-reusable → post-comment). Behaviour is preserved: same headers, same footer, same diff truncation at MAX_DIFF_SIZE, same label-removal on completion. - .github/workflows/issue-research.yml (new) — triggers on `issues: types: [labeled]` gated by the same `ldr_research` label the PR workflow uses (GitHub event-type gating means they don't conflict). Output has two sections: "For the reporter" (cautious framing) and "For maintainers" (raw research context). Issue body is sanitized (control-char strip, 4000-char truncation) and never reaches a shell. - scripts/ldr-research.py — renamed from ldr-diff-research.py (`git mv`, history preserved). Drops --mode, --static-query, --max-diff-size: query now comes from stdin only and the caller workflow does prompt assembly. Output JSON shape: {research, sources, findings, iterations}. - .github/labels.yml — register ldr_research and ldr_research_static so they exist canonically rather than via on-the-fly creation. Reddit research is a follow-up PR; this PR ships the abstraction shape it will need. * docs(ci): regenerate workflow status dashboard for new LDR workflows The check-structure CI gate requires every workflow file to have a row in docs/ci/workflow-status.md. Regenerate to add rows for the two new workflows added in this PR. The live-status flips on unrelated rows (gitleaks, ossf-scorecard, responsive-ui-tests-enhanced, osv-scanner) are accurate snapshots of current status — the auto-regen workflow keeps them fresh on its own schedule. * ci(research): address review feedback — label cleanup, delimiter, artifact Three small follow-ups from the AI review on this PR: 1. Label cleanup on build-query failure. The post-comment job had `if: always() && needs.research.result != 'skipped'`, which meant that if build-query failed, research was skipped and the entire post-comment job (including the label-removal step) was skipped too — leaving a stuck `ldr_research` label on the PR/issue. Switch to `if: always()`; the download and post steps already self-guard with `needs.research.outputs.success == 'true'`, so only the label-removal step runs in the failure path. 2. Randomized GHA output delimiter. `__LDR_QUERY_EOF__` was a fixed string; a query containing that exact line could prematurely terminate the multi-line output. Use $$/$RANDOM/nanosecond as the delimiter base. Defense-in-depth — collision was already astronomically unlikely. 3. Optional `artifact-suffix` input on the reusable workflow. Until now the artifact name was `ldr-research-{run_id}-{run_attempt}-{github.job}`, which collides if a caller invokes the reusable multiple times in one run. The Reddit follow-up will use a matrix call, so add a caller-provided suffix now and sanitize it to artifact-safe chars. Existing callers don't pass it; default empty preserves today's name. * ci(research): fix per-line truncation in reusable workflow Two follow-ups from the second review pass: 1. The awk-based backstop truncation in `Write query to file` was per-line (operating on $0 / length($0)), not total. A long multi-line query with many short lines would silently bypass the max-query-length cap. Swap for a wc -c + head -c approach that truncates total bytes. Verified locally that a 114-byte multi-line input with all-short-lines is now correctly truncated to ~100 bytes. 2. Remove the unused EXIT_CODE capture in `Run LDR Research`. The step relies on JSON validation for error detection; capturing $? without using it was just dead code inherited from the original workflow.	2026-05-11 00:44:16 +02:00
LearningCircuit	91b68acafd	docs(ci): auto-generated workflow status dashboard (#3966 ) * docs(ci): add auto-generated workflow status dashboard Adds `docs/ci/workflow-status.md` — a single page that surfaces every GitHub Actions workflow in the repo, grouped by role, with action items (disabled / stale / manual-only) at the top. Live status badges link to each workflow's runs page. Auto-generated from the workflow YAML files + the GitHub API by `scripts/generate_workflow_status.py`. Why: the GitHub Actions tab is chronological-mixed (poor "is anything red right now?" view), and the static workflow table in `CI_CD_INFRASTRUCTURE.md` drifts when workflows are added/renamed (PR #3963 fixed three factually wrong header claims for exactly this reason). A reference page that mechanically reflects current state + identifies dormant workflows answers both gaps. What's surfaced today (verified live): - Disabled: `nuclei.yml` (caller commented out in `release-gate.yml:177`). - Stale: `update-precommit-hooks.yml` — its weekly Friday cron has been failing for 10+ consecutive weeks (since at least 2026-03-06). This was discovered by the dashboard, not previously tracked. - Manual-only: `check-config-docs.yml`, `sync-main-to-dev.yml` (both intentionally manual; the dashboard shows them so they're not forgotten). Generator design notes: - Resolves reusable workflows correctly: `gh run list --workflow=X.yml` is empty for `workflow_call`-only workflows. The script walks the call graph (release.yml → release-gate.yml → semgrep.yml etc.), fetches the parent run's job list, and matches by job key parsed from the caller YAML (not by name heuristic — `gitleaks-scan` ↔ `gitleaks-main.yml` would otherwise collide with `gitleaks.yml`). - Picks "primary trigger" per workflow so e.g. `codeql.yml` (PR + push + cron + workflow_call) gets its glyph from the gated daily run, not a stale PR run. - Stale check walks the recent runs list to find last success — a workflow that ran red yesterday and green a week ago is not stale. - Manual edits outside the `<!-- BEGIN/END GENERATED -->` markers are preserved on regeneration; the timestamp lives inside the markers so post-marker content is fully user-owned. - Preflights `gh auth status` and rate limit before any per-workflow call — fails fast with actionable message instead of partial output. CI integration: - `.github/workflows/check-workflow-status.yml` runs `--check-structure` on PRs touching workflows, the dashboard, or the generator. Pure structural check (no API calls, no live data) — fast and deterministic. Live regeneration stays on demand. Cost: ~340 GitHub API calls per regeneration, ~45 sec wall-clock, ~6.8% of the 5000/hr authenticated quota. * fixup(ci): review-pass corrections to workflow status dashboard Surfaced by three rounds of code-review + correctness + security agents on the original PR. Four small fixes; no behavioral change to the generated dashboard's content. 1. Recognize commented job keys — `JOB_KEY_RE` now accepts an optional `# ` prefix. Previously, when an entire job block was commented out (e.g. `release-gate.yml:175-181` for nuclei), the commented `uses:` line inherited the previous active job's key (`gitleaks-scan`) instead of the correct `nuclei-scan`. Latent — commented entries are filtered out before reaching gated-run lookup — but would misattribute status if someone partially uncommented a block (uncommented just the `uses:` line). 2. Pin pyyaml to ==6.0.3 in the CI workflow. The repo convention is exact `==` pins (95% of `pip install` calls in workflows); the only floating range was the one introduced by this PR. Matches pdm.lock. 3. Validate marker order in `merge_with_existing`. If a manual edit leaves the BEGIN/END markers reversed (e.g. mid-merge-conflict), bail to a clean overwrite instead of splicing interleaved garbage. 4. Remove `_coerce_jq_stream` — unused helper left behind from an earlier iteration. Zero call sites; no behavior change. Verified by re-running the generator + `--check-structure`. The rendered dashboard's only diff vs prior commit is the regeneration timestamp and live "Last activity" cells (expected — those reflect recent runs since the previous regen). * feat(ci): bucketed activity labels + auto-regen on version bump Two changes that together make the dashboard's diffs meaningful instead of noisy. 1. Coarse activity buckets. Replace exact UTC timestamps in every "Last activity / Last manual run / Last successful run" cell with one of: `this week`, `last week`, `2 weeks ago`, `3 weeks ago`, `last month`, `2 months ago`, `3+ months ago`, `long ago`, `never`. Calendar-day boundaries (no time-of-day jitter) so two regenerations on the same date produce zero diff when nothing actually drifted. Verified: same-day re-runs after stable workflow state → empty diff. Also drop the redundant `Days idle` columns from Stale and Manual-only tables (the bucket label already says it), and round the "Last regenerated" footer to a date. Why: a daily-running healthy workflow used to bump its timestamp every regen (noise). Now it stays in `this week` indefinitely, and the only diffs that land in a version-bump PR are real bucket transitions — exactly the "this slipped from last week to last month — something might be wrong" signal the dashboard exists for. 2. Auto-regenerate on version bump. Add a step to `version_check.yml` right after the existing `generate_config_docs.py` regen. Same pattern as the config docs precedent — the dashboard refresh rides along with each version-bump PR and is reviewable in the same diff. Costs ~340 GitHub API calls per run (well under the GITHUB_TOKEN 1000/hr workflow-runs limit). Adds `actions: read` to the job permissions block; uses `pyyaml==6.0.3` matching pdm.lock. * feat(ci): drop regen timestamp; add health banner; fix in-progress false-stale Three follow-ups to keep version-bump diffs strictly meaningful, plus two correctness fixes uncovered by repeated stability testing. 1. Drop the "Last regenerated" date. Git history is authoritative for "when this snapshot was taken"; embedding a date here forced a single-line diff every regeneration even when nothing else drifted. 2. Aggregated health banner at the top of the generated region: `63 workflows: 1 disabled · 1 stale · 2 manual-only · 59 active` Counts only change when a workflow shifts between {disabled, stale, manual, active} — same level of diff-stability as the per-row buckets. 3. `?event=schedule` for own-cron workflow badges. Verified effective by SHA-comparing badge bodies for workflows with multi-event run history. Makes the badge for e.g. `gitleaks.yml`, `fuzz.yml`, `osv-scanner.yml` reflect cron health specifically, rather than whichever PR ran last. The runs-page link uses the matching `?query=event%3Aschedule` so a click lands on the filtered run list. 4. Fix false-stale during in-flight release runs. Previously, when release.yml was running, gates reachable via release.yml (puppeteer-e2e-tests, ci-gate, etc.) would briefly flip to "stale" because `fetch_last_gated_run` returned the in-progress run first and `last_success` couldn't see past it. Now the function walks all 5 caller runs and returns both the latest match (for activity) and the latest successful match (for staleness), avoiding the flip. 5. Map all GitHub conclusion enum values. A `gitleaks.yml` run completed with `action_required` between two test regens; the glyph table didn't have it and rendered `?`. Added every documented value (`neutral`, `timed_out`, `stale`, `action_required`) and changed the unknown-fallback from `?` to em-dash, so future GitHub-side enum additions don't introduce a false-positive diff. Verified: two same-day regens after workflow state has settled now produce zero diff. * ci(version-bump): make workflow-status regen non-blocking Add `continue-on-error: true` to the dashboard regeneration step in version_check.yml. The regen calls ~340 GitHub API endpoints and would otherwise block the entire version-bump PR if any of them transiently fail (rate-limit hit, GitHub Actions outage, etc.). The failure mode should be "dashboard stays at the previous snapshot until next successful regen", not "release pipeline is blocked". The sibling `generate_config_docs.py` step doesn't need this — it's purely local with no external API dependency.	2026-05-10 15:58:32 +02:00
LearningCircuit	1315b679e0	ci(research): switch E2E research workflow to langgraph-agent strategy (#3965 ) * ci(research): switch E2E research workflow to langgraph-agent strategy The ldr_research label runs scripts/ldr-diff-research.py, which until now didn't pass a search_strategy and so fell through to the quick_summary default of source_based. Switch to the agentic langgraph-agent strategy so the workflow exercises the autonomous research path. - Adds --strategy CLI arg and LDR_STRATEGY env var, default langgraph-agent (consistent with the existing --provider / --search-tool / --iterations pattern). - Workflow exposes LDR_STRATEGY: vars.LDR_STRATEGY \|\| 'langgraph-agent' so the choice is overridable per-repo via Variables. - Notes in the script docstring that LDR_ITERATIONS=1 is a no-op for the langgraph strategy (which reads langgraph_agent.max_iterations from settings instead). * ci(research): consolidate model var to LDR_RESEARCH_MODEL The workflow had two model variables — vars.LDR_MODEL for diff mode and vars.LDR_STATIC_MODEL for static mode — selected by a small set-model step. Collapse to a single LDR_RESEARCH_MODEL variable shared by both labels, mirroring the AI reviewer's vars.AI_MODEL pattern. - Default: google/gemini-2.0-flash-001 (the value the script was already falling through to). - Override via Settings → Variables → New repository variable → name: LDR_RESEARCH_MODEL. - The set-model step is removed; the workflow now passes the env var through directly. - Script reads LDR_RESEARCH_MODEL instead of LDR_MODEL. Note: existing repo variables LDR_MODEL and LDR_STATIC_MODEL become orphaned by this rename and can be deleted from repo settings. * ci(research): stop overriding strategy iterations from the workflow Previously the workflow set LDR_ITERATIONS=1 and the script forwarded that as iterations= in kwargs. For source_based that capped research at one iteration; for langgraph-agent it was effectively a no-op (langgraph reads max_iterations, not iterations) but the wiring was misleading. - Drop LDR_ITERATIONS from the workflow env block. - Make --iterations default to None in the script and only forward it to quick_summary when explicitly set on the CLI. - Each strategy now uses its own setting-driven default unless overridden — for langgraph-agent that means langgraph_agent.max_iterations (default 50) flows through unchanged. * ci(research): split research model into MAIN + CHEAP per label Bring back per-label model selection with cleaner names: - ldr_research → vars.LDR_RESEARCH_MODEL (deep PR analysis, user-configurable) - ldr_research_static → vars.LDR_RESEARCH_CHEAP_MODEL (regression smoke, kept cheap) Both default to google/gemini-2.0-flash-001 if unset, so existing behaviour stays identical until you actually configure cheap-model. The script and its env-var contract are unchanged — the workflow just picks which value to feed into LDR_RESEARCH_MODEL based on the applied label.	2026-05-10 13:10:02 +02:00
LearningCircuit	903a2db8af	ci(nuclei): authenticate DAST scan + seed URLs from Flask url_map (#3698 ) * ci(nuclei): authenticate scan + seed URL list from Flask url_map Previously the Nuclei DAST job ran against an unauthenticated single target (`http://localhost:5000`) with no URL list. Because Nuclei is template-driven (not a crawler) and the LDR app is auth-gated, the scanner only ever saw `/auth/login`, the index, and a couple of unauthenticated endpoints. The 2-minute scan over 10k templates produced only 5 info-level findings, all of which were intentional design choices (CSP `unsafe-inline`, SameSite=Lax, OPTIONS verb, form detection) — i.e. the gate was effectively a green-checkmark. Now the workflow: 1. Pre-creates the standard CI `test_admin` user via the existing `init_test_database.py` helper (avoids slow registration + rate limits). 2. Logs in via the real /auth/login flow with CSRF token, captures the Flask session cookie, and verifies via /auth/check. 3. Dumps the Flask url_map (excluding parameterized routes, static, and POST-only endpoints) into urls.txt so Nuclei probes every blueprint route, not just `/`. 4. Runs Nuclei with `-list urls.txt` and the authenticated session cookie via `-H "Cookie: session=..."`. 5. Filters to severity >= low to drop the four info-level findings that are intentional design choices. The session cookie is masked in logs via `::add-mask::` so it doesn't leak into the run output. Test credentials match the convention used by the playwright-webkit-tests and puppeteer-e2e-tests workflows. Adds scripts/ci/dump_url_map.py as a small helper that imports `create_app()` and iterates `app.url_map.iter_rules()` — reusable from other DAST workflows (e.g. ZAP API scan) that benefit from URL seeding. * ci(nuclei): address findings from review pass Three differentiated review agents flagged five actionable items on the authenticated-Nuclei PR. This commit addresses all five: * dump_url_map.py: stop skipping parameterized routes. Substitute a Flask-converter-appropriate placeholder (int/float→1, uuid→all-zeros, default→"nuclei") so Nuclei still probes path-traversal / parameter- injection / SQLi templates against routes like /research/<research_id> and /api/research/<research_id>/status. Without this, the bulk of the authenticated app surface (history, research, API blueprints) was silently excluded — which defeats the PR's purpose. * nuclei.yml -etags intrusive,dos,fuzz: now that Nuclei holds a real session, default templates could mutate state or DoS the runner. This is the standard exclusion set for authenticated DAST. * nuclei.yml: replace `cat cookies.txt` in the missing-cookie error branch with a column-filtered `awk` that omits the value column. The cookie is masked via `::add-mask::` after this point, so the previous branch could leak the session token in CI logs if the extraction regex ever broke. * nuclei.yml: add `sleep 2` between auth/check and the Nuclei step so the post-login background thread (settings migration + library init, see web/auth/routes.py:_perform_post_login_tasks) finishes before probes start and 500 on settings-dependent routes. * nuclei.yml: drop `# pragma: allowlist secret` on TEST_PASSWORD. The repo uses gitleaks (.gitleaks.toml already allowlists `testpass123`), not detect-secrets — the pragma was dead weight. Out of scope for this PR (recorded but not changed): - 3-way credential drift (init_test_database.py / nuclei.yml / auth_helper.js all hardcode test_admin/testpass123) - Nuclei binary version `latest` auto-updating (matches existing CI) - create_app() side effects in dump_url_map.py (currently benign)	2026-04-27 23:11:40 +02:00
LearningCircuit	3b1d6c6b2f	feat: redesign journal quality system with data-driven scoring and predatory auto-removal (#3081 ) * feat: redesign journal quality system with data-driven scoring and predatory auto-removal Replace the expensive LLM-based journal scoring (SearXNG + AdvancedSearchSystem per journal) with a tiered data-driven approach: Tier 0: DB cache (instant, from previous runs) Tier 1: Predatory check — auto-removes results from blacklisted journals/publishers Tier 2: OpenAlex snapshot — h-index + DOAJ from ~217K sources (downloaded at runtime) Tier 3: DOAJ check — quality floor for open access journals (downloaded at runtime) Tier 4: LLM analysis — SearXNG fallback (now optional, not required) Bundled data: - Stop Predatory Journals: 6K predatory publishers/journals (MIT license) Downloadable data (CC0, loaded if present): - OpenAlex sources snapshot: 217K journals/conferences with h-index, impact factor - DOAJ journals: 22K+ journals with DOAJ Seal status Key changes: - Extended Journal DB model with bibliometric fields (h-index, impact factor, DOAJ, predatory status, provenance tracking) + Alembic migration - JournalReputationFilter now uses tiered scoring with journal dedup - SearXNG no longer required — filter works with bundled data alone - Predatory journals auto-removed (with whitelist override for false positives) - Added journal filter to Semantic Scholar (was the only scientific engine without it) - OpenAlex results now include source_id and source_type for direct lookups - Fixed score parsing (regex instead of strict int()), prompt truncation, fail-fast on SearXNG failures, lru_cache on name cleaning * fix: address code review findings from Round 1 - Remove dead __check_result method, update tests to use filter_results - Fix predatory substring matching (min length guard prevents false positives) - Add name parameter to is_whitelisted for journals without ISSN - Fix migration: server_default for Booleans, correct index creation logic - Improve safety net logging in filter_results * fix: forward journal quality fields through _get_full_content (Round 2 review) OpenAlex _get_full_content was constructing a new result dict without forwarding journal_ref, openalex_source_id, and source_type from the preview. This effectively disabled journal quality filtering for all OpenAlex results since the content filters run after full content retrieval and couldn't find the journal_ref key. * fix: address Round 3 review findings — bugs, thread safety, tests Critical bug fixes: - Add missing quality_model column to migration 0005 - Fix dedup to use richest metadata (two-pass approach) - Predatory cache entries no longer expire via normal TTL Performance: - Build indexed sets for predatory data at load time (O(1) exact match) - Add threading.Lock for singleton and lazy property loading Data quality: - Deduplicate predatory.json (removed 21 dupes) Test coverage (38 new tests): - JournalDataManager: derive_quality_score, is_predatory, is_whitelisted, lookup_openalex, lookup_doaj, _expand_openalex_record, singleton * fix: address all review findings — critical bugs, security, performance Critical bugs: NASA ADS journal_ref, empty string guard, regex name cleaning with LLM fallback, DOAJ field overwrite protection, predatory cache TTL re-evaluation. Security: prompt injection sanitization, log injection prevention, Unicode NFKC normalization for predatory lookups. Important bugs: predatory publish-after-indexes race fix, Tier 0 DB error handling. Performance: regex-based name cleaning eliminates ~5 LLM calls/batch. * fix: .text() → .content for LangChain, improve regex name cleaning Critical runtime fix: - LangChain AIMessage has .content attribute, not .text() method. Both LLM calls in the filter (name cleaning and Tier 4 scoring) would crash with AttributeError at runtime. Fixed both occurrences and updated all test mocks. Regex improvements: - Add bare trailing citation number stripping (", 95, 146802") - Add volume(issue) pattern stripping ("141(5)") - Fix month regex: require at least 1 digit after month name and add word boundaries (prevents "May" in journal names being stripped) - Only skip LLM when regex result has no residual numerics — complex citation strings like "Phys. Rev. Lett. 95, 146802 (2005)" correctly fall through to LLM instead of returning partially-cleaned name * feat: add journal quality dashboard at /metrics/journals Dashboard with summary stats, quality distribution chart, score source doughnut, sortable/filterable journal table with pagination, quality badges, trust signal icons, empty state, help panel, mobile responsive. API: GET /metrics/api/journals — all journals + summary in one call. * fix: XSS prevention, missing API fields, sort null handling in dashboard Security: - Add escHtml() helper for HTML entity escaping in all innerHTML injections (journal names, publishers, predatory_source, source badges) - Prevents XSS via crafted journal names containing HTML/JS API: - Add works_count and cited_by_count to journal API response (bibliometric fields useful for dashboard display) UX: - Fix sort comparison with null values: nulls pushed to end consistently instead of unpredictable placement from mixed Infinity/string comparison * fix: dashboard null-quality filter, avg h-index N/A, core label - Fix null-quality journals appearing in predatory tier filter (quality \|\| 0 coerced null to 0, which passed predatory check) - Fix avg h-index showing "0" when no journals have h-index data (API now returns null, frontend shows "—") - Rename "Scopus Indexed" to "Core Indexed" (OpenAlex is_core is CWTS core status, not Scopus indexing) * feat: SQLite reference DB for dashboard with server-side pagination Replace client-side 212K journal array with a shared read-only SQLite database built from bundled JSON on first access. Near-zero RAM usage. * perf: split summary from pagination queries in journal dashboard Summary stats + chart data (3 SQL queries, ~130ms) are now fetched only on initial page load via include_summary=true param. Subsequent pagination, sorting, and filter changes only fetch the journal page (1 query, ~7ms), making navigation feel instant. * fix: expose Chart.js globally, split summary from pagination queries - Add window.Chart = Chart in app.js so inline scripts can use Chart.js (was imported but never exposed on window — caused ReferenceError) - Split summary from pagination: include_summary=true only on initial load, page/filter/sort skip the 3 extra SQL queries - NOTE: run `npm run build` to rebuild the Vite bundle * fix: guard Chart.js usage and defer initial load for module script timing The Vite bundle loads as type="module" (deferred), but the inline script in journal_quality.html runs immediately. Chart is not yet on window when the script executes, causing ReferenceError that kills the entire script block including the data loading call. Fix: guard Chart usage with typeof checks, defer loadJournalPage to window.onload so module scripts have finished executing. * feat: upgrade journal filter logs from debug to info level Users can now see the tiered scoring process in their logs: - Tier 0: cache hit with score - Tier 1: predatory detection + whitelist override - Tier 2: OpenAlex match with h-index - Tier 3: DOAJ match with seal status - Tier 4: LLM analysis result - Summary: passed/below-threshold/predatory breakdown * fix: add 'the' prefix fallback for journal name lookups, add lookup logs Many OpenAlex journals start with 'The ' (e.g., 'The Astrophysical Journal Letters') but ArXiv journal_ref omits it. Now tries with/without 'the ' prefix when exact match fails — fixes ~5K potential Tier 2 misses that would unnecessarily fall through to expensive Tier 4 LLM analysis. Applied to both JournalDataManager (in-memory) and JournalReferenceDB (SQLite). Added debug-level logs for lookup hits/misses. * feat: quality tags in sources, sidebar menu, documentation - Attach journal quality score to each result in filter_results - Display quality tags in research output source lists: [Q1 ★★★★★] for elite, [Q2 ★★★] for moderate, etc. - Add "Journals" item to sidebar under Analytics section - Create docs/journal-quality.md with full system documentation * fix: restore docstrings, increase DOAJ Seal score, fix truncated file Address djpetti's review comments: - Restore full Args/Returns docstrings on __init__, create_default, __db_session, __make_search_system, __clean_journal_name, __analyze_journal_reputation, __save_journal_to_db - Remove "unlike the previous version" reference from create_default - Add clarifying comment on regex vs LLM name cleaning tradeoff - Increase DOAJ Seal score from 6 to 7 (2-point spread vs 1-point) - Fix file truncation from disk-full error (line 763) * refactor: move build logic into journal_reference_db module Eliminate sys.path hack, make build logic importable. Script is now a thin CLI wrapper. derive_quality_score imported from data_manager (canonical copy) instead of duplicating. * fix: review findings — docs, sidebar, dashboard, test gaps Address final review round findings: - Fix DOAJ Seal score in docs (6→7) - Sidebar: use url_for() instead of hardcoded URL - Template: set active_page='journal-quality' for sidebar highlight - Rename stat-scopus to stat-seal with label "DOAJ Seal" (was mislabeled) - Always use window.onload for initial load (readyState fast path unsafe) - Add tests for _format_quality_tag (6 tests, all 5 tier branches + None) - Add tests for "the" prefix fallback in lookup_source (2 tests) * feat: add CORE conference rankings (795 CS conferences) Bundle CORE Rankings (ICORE2026) for automatic conference scoring: A→9, A→7, B→5, C→4. Acronym + proceedings prefix matching. Eliminates Tier 4 LLM calls for major CS conferences. feat: add data source attribution to journal quality dashboard Credit the open academic data projects that make the dashboard possible: OpenAlex (CC0), DOAJ (CC0), CORE Rankings, Stop Predatory Journals (MIT). Displayed as an attribution section at the bottom of the page. * fix: remove CORE conference data (no open license) CORE Rankings are copyrighted (c) 2013 Computing Research & Education with no published open license. Redistribution in an MIT project is not permitted without explicit permission. Removed core_conferences.json from bundled data. The build function _load_core_conferences gracefully returns {} when the file is absent. Conference matching still works via OpenAlex data + proceedings prefix stripping. Verified remaining data licenses: - OpenAlex: CC0 Public Domain (confirmed) - DOAJ metadata: CC0 (confirmed on doaj.org) - Stop Predatory Journals: MIT License (confirmed in GitHub LICENSE) * docs: add data source attribution to README, docs, code, and dashboard Credit open academic data projects at multiple touchpoints: - README.md: Journal Quality feature links to data sources - docs/journal-quality.md: expanded attribution table with websites - data/__init__.py: license details per bundled file - journal_reference_db.py: data sources in module docstring - Dashboard: attribution section with links (already added) All bundled data verified: OpenAlex (CC0), DOAJ metadata (CC0), Stop Predatory Journals (MIT). * fix: DOAJ Seal score consistency across all tiers Tier 2 (OpenAlex) now cross-references DOAJ for Seal status via dm.has_doaj_seal(issn). Tier 3 now calls derive_quality_score instead of hardcoding score=6. All tiers consistently score DOAJ Seal at 7. Fixed docs inconsistency. * feat: add CitationMetadata model for structured academic metadata New citation_metadata table stores bibliographic data on academic research sources using CSL-JSON vocabulary. 1:1 with ResearchResource. - CitationMetadata model: doi, arxiv_id, pmid, authors, year, volume, issue, pages, container_title, journal_id FK, csl_json - Migration 0006: create table + indexes - citation_normalizer.py: engine-specific → CSL-JSON normalization - extract_links: preserve citation fields (was dropping 90% of data) - research_sources_service: create CitationMetadata for academic sources - Quality never stored — derived via journal_id at query time * refactor: simplify Journal table to only cache Tier 4 LLM results Tiers 1-3 use bundled data (instant, no caching needed). Only Tier 4 (LLM) results cached in DB. Wire up journal_id FK on CitationMetadata. * feat: auto-download journal data from GitHub Releases Replace bundled data files with on-demand download: - journal_data_downloader.py: fetch from GitHub Releases on first use - Data in user dir (not package dir, read-only in pip installs) - Dashboard shows download banner when data missing - API: GET/POST /metrics/api/journal-data/{status,download} - predatory.json (307KB) stays bundled, large files never in git * refactor: fetch journal data from APIs instead of GitHub Releases Fetch directly from OpenAlex and DOAJ public APIs. No redistribution concerns — data fetched fresh from CC0 sources (~3 min first run). * fix: review findings — h_index=0 edge case, dead code, missing field - derive_quality_score: h_index=0 no longer bypasses DOAJ Seal score (0 means newly indexed, not low quality) - citation_normalizer: remove dead arxiv check in detect_engine - extract_links: add source_engine to preserved fields - paths.py: fix stale docstring (GitHub Releases → APIs) * fix: DB race condition and journal name normalization (Round 3 review) - Wrap __save_journal_to_db commit in try/except to handle concurrent inserts gracefully (rollback + warning) instead of incorrectly incrementing the SearXNG failure counter - Add geographic qualifier stripping to regex cleaner: "(London)", "(New York)", "(US)" etc. are now stripped deterministically, preventing duplicate scoring of the same journal under variant names * fix: DB race condition and journal name normalization (Round 3 review) - S2 close() now calls super().close() to properly clean up the JournalReputationFilter (SearXNG engine + LLM). Before this fix, adding content_filters to S2 created a resource leak since S2's close() override didn't delegate to BaseSearchEngine.close(). * fix: DB race condition and journal name normalization (Round 3 review) - Fix predatory substring matching: check both directions for renamed publisher variants while keeping >= 10 char guard - DB cache read: logger.exception for stack trace preservation - Model Boolean columns: add server_default=sa_false() - Migration downgrade: drop indexes before columns * fix: correct url_to_quality type annotation after merge (Round 4 review) Type was `dict[str, dict]` but values are `int` scores from the journal quality filter. Changed to `dict[str, int]`. * fix: CI failures — sensitive logging and file write allowlist - journal_data_downloader: use logger.exception() instead of f-string with exception variable (sensitive-logging check) - Add journal_data_downloader.py to file-write security check allowlist (writes public CC0/MIT journal metadata, not user data) * fix: skip journal reference DB tests when DB not built (CI timeout fix) The test fixture was calling db.available which triggers _get_conn() which auto-downloads 200K+ sources from OpenAlex API. In CI this caused 60s timeouts on 26 tests. Now checks db_path.exists() directly. * fix: renumber migration 0005 → 0007 to resolve multiple-heads conflict Main already has 0005_add_resource_document_id and 0006_add_citation_metadata. Our migration was also numbered 0005, causing Alembic to reject login with "multiple heads" error. Renumbered to 0007 with down_revision=0006. * fix: align test mock chains with real Tier 0 DB query pattern Tests were mocking .filter_by().first() but real code does .filter_by().filter(score_source=="llm").first(). Fixed mock chains to match. Also fixed docs typo: reanalysis_period default 265 → 365. * fix: journal dashboard showing "not installed" when reference DB exists get_journal_data_status() only checked for raw JSON source files, not the compiled journal_reference.db. If the DB existed without source JSONs (e.g., after cleanup), the dashboard refused to load. * feat: add DOI-based venue identification and conference detection Adds a pre-enrichment layer that resolves paper DOIs to OpenAlex source IDs via batch lookup (up to 50 DOIs per HTTP request). This gives the journal quality filter a reliable ID-based lookup path instead of fragile name matching. Changes: - New: openalex_enrichment.py — batch DOI → source_id resolution - Integration hook in search_engine_base.py for scientific engines - Conference detection heuristic as fallback for papers without DOI - Year stripping in OpenAlex lookup: "NeurIPS 2023" → "NeurIPS" - NASA ADS now extracts DOI to result dict - Fix stale AdvancedSearchSystem mocks in tests * fix: handle missing thread context in preview filter phase The journal filter runs as a preview_filter (before LLM relevance) for instant data lookups. But DB operations (Tier 0 cache, save) require thread context which isn't available in the preview phase. Fix: __db_session() returns None when no context available. Callers skip DB operations gracefully — data-only tiers (1-3) still work. * feat: disable Tier 4 LLM journal scoring by default (too slow) * feat: institution scoring tier + DataSource refactor - New DataSource ABC + registry under utilities/data_sources/ unifying openalex, doaj, jabref, predatory, and institutions sources - Add InstitutionSource (OpenAlex Institutions, ~123K records) for affiliation-based scoring of preprints - Add Tier 3.5 (institution lookup) to journal_reputation_filter for the no-journal_ref salvage path and as a max() lift for preprint repositories with weak Tier-2 scores - Extract author affiliations in OpenAlex search engine - Wire JournalReputationFilter into PubMed engine and fix journal_ref field aliasing - Tighten regex cleaner for journal_ref (year/month/volume debris) - Delete bundled src/local_deep_research/data/ — all sources now fetched at runtime with shared auto_download policy - Dashboard banner shows all academic data sources with license + status * refactor: consolidate journal-quality system into one package with SQLAlchemy - New package src/local_deep_research/journal_quality/ groups all journal-related modules (downloader, db, models, scoring, data_sources) - Single source of truth: gz files compile into one journal_quality.db via build_db(); JournalDataManager dict-based loader is deleted - SQLAlchemy 2.0 ORM throughout (models.py + db.py); filter call sites unchanged thanks to dict-shaped lookup return values - Read-only enforcement at three layers: SQLite mode=ro&immutable=1, POSIX chmod 0o444 after build, and a pre-commit hook that bans cross-module writable opens of journal_quality.db - Downloader rebuilds the DB synchronously after each successful fetch - New tables: predatory_journals/_publishers/_hijacked, institutions, abbreviations - Tests migrated to tests/journal_quality/; 207 tests pass * fix: P0/P1 bugs from journal-quality code review - P0: flag hijacked journals as predatory in _populate_sources (loaded into predatory_hijacked but never checked against sources) - P0: insert DOAJ-only journals (~8K rows) via second pass over doaj_data; previously only OpenAlex venues entered the DB - P0: replace `mod._ref_db = None` with `reset_db()` in metrics rebuild route (the singleton attr is `_db`, not `_ref_db`) - P0: change JournalQualityDB._lock to RLock to prevent first-run deadlock (_ensure_engine → build_db → reset_db re-acquires lock) - P1: dedup sources on (name_lower, issn) so print + electronic ISSN variants both survive; drop unique=True on Source.name_lower - tests: cover hijacked, DOAJ-only, and dual-ISSN cases * fix: resolve CI failures on journal-quality refactor - pre-commit: add missing .pre-commit-hooks/check-journal-quality-readonly.py to git (file existed locally but was never committed, so CI couldn't exec it) - file-writes scan: extend allowlist to cover the new journal_quality/downloader.py and journal_quality/data_sources/.py paths (the old `journal_data_downloader.py` entry no longer matches after the package move) - mypy: fix 12 errors in journal_quality/db.py - explicit list[] annotation on `wheres` - dict comprehension on Row sequence in get_source_distribution - wrap loader returns in dict() so SQLAlchemy stub Any-types resolve - type: ignore[arg-type] on bulk_insert_mappings (known stub gap; SQLAlchemy 2.x types accept type[T] at runtime but stubs say Mapper) - CodeQL py/incomplete-url-substring-sanitization: anchor doi.org URL parsing on scheme prefixes instead of substring `in` check refactor: address djpetti review comments on journal quality system Tier 4 LLM scoring is now opt-in via the new search.journal_reputation.enable_llm_scoring setting (default off) instead of being unreachable behind a hardcoded flag. The redundant in-process lru_cache on the LLM analyzer is gone - Tier 0 (DB cache) already covers repeat lookups, and keeping the cache only masked DB write failures. Trailing-year stripping for conference names ("NeurIPS 2023" -> "NeurIPS") moves into __regex_clean_journal_name where it belongs, replacing the post-hoc retry block in __score_journal. DOAJ Seal score bumped 7 -> 8 to reflect the certification meaning more faithfully (top ~10% of DOAJ journals, curated against best OA practices). The h-index >= 7 tier mapping is unchanged so no test fixtures break. Adds /api/journals/research/<id> + a "View Journals" button on the research details page so users can see the journals encountered in a single research session, not just the cross-research aggregate. Joins through CitationMetadata -> ResearchResource without schema changes. Adds quartile (Q1-Q4) as a display-only signal on Source rows, derived at build time from cited_by_count percentile within each source_type. Quality scoring is unchanged - h-index remains the canonical bibliometric. Magic numbers in scoring.py / db.py extracted into a Journal Quality Scoring Thresholds section in constants.py. Institution scoring is now consolidated to scoring.py::institution_score_from_h_index, fixing an unreachable branch in db.py::score_from_affiliations along the way. Misc: - OPENALEX_ENRICHMENT_API_TIMEOUT lifted into constants.py (was hardcoded 15) - Deleted scripts/build_journal_reference_db.py - auto-build on first access plus the dashboard rebuild button cover all use cases * perf(journal-quality): switch data sources to bulk dumps + release-gate test Replace paginated REST API fetches with public bulk snapshots: - OpenAlex Sources: S3 manifest + parts (~280K, ~270s vs 5-10min) - OpenAlex Institutions: S3 manifest + parts (~120K, ~156s vs 5-10min) - DOAJ: single CSV dump (~22K, ~2s) Bulk paths are the OpenAlex/DOAJ-recommended way to pull the full dataset and eliminate hundreds of rate-limited requests on every "Download Data" click. Compact output formats are preserved so the build pipeline and runtime accessors are unchanged. Add a release-gate integration test + dedicated workflow that downloads all 5 sources in parallel, builds the reference DB end to end, and scores a real journal. Catches upstream schema breaks (renamed fields, restructured dumps) before we cut a release. * test(journal-quality): exercise dashboard query methods in release gate * docs(journal-quality): credit upstream data providers on dashboard * docs(journal-quality): add 'How It Works' tab explaining tiered scoring * fix(journal-quality): score unknown journals as 3, log institution names - Lower truly-unknown journals (no OpenAlex/DOAJ/Tier 3.5 hit) from pass-through to score 3 so the default threshold (4) actually filters them. Distinct from predatory (1) — these are merely unknown. - Fix AttributeError in OpenAlex search engine when work has DOI key with explicit None value: use \`work.get('doi') or work_id\` instead of \`work.get('doi', work_id)\`. Was dropping ~14% of results per search before they reached the filter. - Include matching institution names in Tier 3.5 log lines so the affiliation salvage path is debuggable. * refactor(journal-quality): demote per-journal scoring logs to DEBUG, log institutions on score-3 * fix(openalex): handle None values for display_name, id, source.id OpenAlex routinely returns these keys with explicit null values, which bypassed the dict.get default and crashed downstream string operations (slicing, split). Same antipattern as the 'doi' fix in `b4f43f3e6`. Errors were causing whole search batches to fail with TypeError: 'NoneType' object is not subscriptable at line 222. * fix(journal-quality): handle MEDLINE name format + publisher suffixes PubMed serves journal names in MEDLINE format which OpenAlex doesn't match directly: - '[Original-language] English title' → strip leading bracket - 'Title : long subtitle' → fall back to the head segment - 'Title. Section name' → fall back to the head segment (>=6 chars) Also strip trailing publisher names (Elsevier, Springer, Wiley, etc.) that some engines glue onto the journal_ref. Was causing Molecular Therapy, Journal of Alzheimer's Disease, and ~6 other major biomed journals to be dropped as score-3 unknowns on PubMed searches. * feat(journal-quality): default threshold to 2 (predatory-only) Drop the default from 4 to 2 so the filter's out-of-the-box behavior is conservative: predatory journals are still auto-removed, but unknown/low-confidence venues (score 3) are kept. Users who want stricter filtering can raise the slider in Settings. Avoids the 'silently delete sources we don't have data on' problem that the threshold=4 default was causing on PubMed and arxiv searches. * docs(journal-quality): document threshold semantics + link to docs from dashboard - Update docs/journal-quality.md with new tier pipeline (Tier 3.5 + score-3 floor + Tier 4 off by default), bulk-dump source counts, and threshold table - Add 'Threshold setting' card to dashboard 'How It Works' tab - Link to docs/journal-quality.md from the dashboard help tab * feat(journal-quality): add threshold slider to dashboard help tab Live slider 1-10 with per-level explanations. Loads the current value from /settings/api/search.journal_reputation.threshold on first tab open and saves on change via PUT (debounced 300ms). * feat(journal-quality): hoist threshold slider to top of dashboard Compact slider widget below the data sources banner, always visible. Synchronized with the full slider in the How It Works tab so changing either updates both. Loads on page open instead of lazy-loading on tab switch. * feat(journal-quality): show global toast when threshold slider saves * feat(journal-quality): make Global Database the default tab Combines naturally with the threshold slider above — users can immediately see the score distribution they're filtering against. Your Research tab moved to second position and lazy-loads on switch. * feat(journal-quality): show direct dataset links on dashboard sources cards * fix(journal-quality): point DOAJ dataset link to docs page, not raw CSV * fix(journal-quality): use DOAJ FAQ for dataset link (public-data-dump 404) * fix(journal-quality): correct DOAJ dataset link to public-data-dump page * review(djpetti): address PR review comments - filter: drop @lru_cache on __clean_journal_name (DB cache covers it) - filter: fix __db_session docstring (returns None, never raises) - filter: restore long-form Tier 4 LLM prompt (avoid silent calibration regressions) - filter: add Tier 3.6 LLM name-cleanup salvage that retries OpenAlex with a canonicalised name (gated behind enable_llm_scoring opt-in) - filter: bump Tier 4 LLM scores by +1 when the journal has the DOAJ Seal - filter: persist quartile + DOAJ status in __save_journal_to_db so the dashboard and Tier 0 cache see the same metadata Tier 2 used - scoring: derive_quality_score now honours quartile directly (Q1→strong, +elite when h-index also tops the threshold) - model: add Journal.sjr_quartile column + Alembic 0008 migration - citation_normalizer: take over the canonical _extract_doi - openalex_enrichment: use project-level USER_AGENT constant - journal_quality dashboard: default to "Your Research" tab * review(djpetti): inject project User-Agent into safe_get/safe_post djpetti's openalex_enrichment.py:124 comment was specifically about "injected into safe_get", not just using the constant. Make safe_get, safe_post, and SafeSession.request auto-set User-Agent from the project-level USER_AGENT constant when the caller didn't supply one. Drops the manual override in openalex_enrichment except for the email polite-pool variant. * review(round-2): six correctness fixes + dashboard quartile + tests Six confirmed bugs from the 25-agent merge-readiness review (tracked in plans/spicy-finding-wreath.md), all surgical and confined to files already touched by this PR: A. filter: stop losing the negative DOAJ signal journal_reputation_filter.py:778-779 (Tier 2) and 908-909 (Tier 4 DOAJ-Seal bonus) used `is_in_doaj=oa_doaj or None`. `False or None == None`, and __save_journal_to_db treats None as "don't update", leaving the column NULL after a Tier 2 hit even when OpenAlex told us the answer. The bug was not just observability — it broke `not is_in_doaj` (scoring.py:82, predatory branch), the predatory whitelist override (db.py:1024), and the dashboard trust icon. Tier 2 now passes the boolean directly; Tier 4 uses `True if seal_bonus else None` so the no-bonus case is silent instead of clobbering Tier 2 data with a guessed False. A2. journal_quality.db.reset_db() now holds _db_lock The /api/journal-data/download HTTP handler called reset_db() concurrently with searches in flight. Without the lock, a third thread calling get_db() could pass `if _db is None` while reset() was disposing, then short-circuit in _ensure_engine on the still- set _engine attribute and return a disposed pool. A3. __searxng_consecutive_failures is now per-thread The filter instance is cached and reused across concurrent searches by parallel_search_engine.py. The shared mutable counter was clobbered by Thread B's reset, defeating the fail-fast that's supposed to disable Tier 4 after 2 consecutive failures. Replaced with threading.local() + three private accessors so each thread gets its own counter, reset at the top of every filter_results(). A4. PNAS-class journals are now exempt from the conference heuristic "Proceedings of the National Academy of Sciences" matched the bare `proceedings` regex and was auto-classified as a Q3 conference, throwing away its real h-index ~1,400. Same for the Royal Society, AMS, LMS, etc. Added a `lower().lstrip().startswith("proceedings of ")` guard before the heuristic. A6. downloader.needs_update logic is no longer inverted The check was `installed_version is not None and != latest`, so it returned False when no data was installed at all — first-run users never saw the "download data" CTA. Changed `and` to `or`. The test_no_files test that was catching this now passes. B. __regex_clean_journal_name strips leading ordinal markers "12th International Conference on Machine Learning" now cleans to "International Conference on Machine Learning" — has a fighting chance of matching OpenAlex. Polish D. Surface sjr_quartile on the dashboard /api/journals/user-research and the per-research endpoints now include sjr_quartile on the journal row dict. The Your Research and Global Database tables both gain a Quartile column rendered as a colored chip (Q1=green, Q2=blue, Q3=yellow, Q4=orange) via a new getQuartileChip() helper. Quartile was the entire point of migration 0008 + the recent scoring work, and it had been computed and persisted but never displayed. Polish E. Promote "python-requests" literal to _DEFAULT_REQUESTS_UA_PREFIX constant in safe_requests.py so a future requests-library rename is a one-line edit. Test C. 30 new unit tests covering the 6 PR fixes - test_scoring.py: TestDeriveQualityScoreQuartile (13 tests) — Q1/Q2/Q3/Q4 mapping, case insensitivity, Q1 + elite h-index → 10, fall-through on unknown quartile, predatory override. - test_citation_normalizer.py: extended TestExtractDoi with 7 cases (external_ids / externalIds / lowercase / dx.doi.org / http / doi field priority / SSRF guard). - test_safe_requests.py: TestUserAgentInjection (6 tests) — auto- inject when missing, preserve explicit UA, case-insensitive header check, no caller-dict mutation, both safe_get and safe_post. - test_journal_reputation_coverage.py: TestTier4DoajSealBonus (3 tests — bumped, capped at 10, no-bump silent) and TestTier36LlmNameCleanup (2 tests — relabel hits OpenAlex on retry, relabel-then-miss falls through to Tier 4). 341 tests pass across the affected suite (was 273 before this commit). No new failures. * fix(tests): update migration head revision assertions to 0008 The migration chain now has 8 migrations (0001-0008). Tests that hardcoded "0005" as the expected head revision now correctly expect "0008". Also renamed test functions to be version-agnostic (test_head_revision_is_current instead of test_head_revision_is_0005). * test(security): add tests for 6 critical pre-commit security hooks Adds 74 tests verifying the security hooks enforce data protection: - test_deprecated_db_hook: Detects get_db_connection() and raw db_manager.get_session() that bypass per-user encrypted databases - test_ldr_db_hook: Detects shared DB references that would leak data - test_sensitive_logging_hook: Detects password/API key/token logging - test_env_vars_hook: Enforces SettingsManager for LDR_* env vars - test_journal_quality_readonly_hook: Enforces read-only DB access - test_silent_exceptions_hook: Detects silent except:pass patterns Test strings use dynamic construction to avoid triggering the very hooks they test (e.g., _DEPRECATED_DB = "ldr" + ".db"). * docs: fix module docstring to match actual scoring tiers * fix: move DB cache check from position 0 to before LLM tiers The DB cache only stores Tier 4 (LLM) results. Tiers 1-3 use bundled data that is instant and doesn't need caching. Moving the DB cache check to right before the LLM tiers avoids a needless DB query for journals that will be scored instantly by the bundled data tiers. * fix: resolve CI test failures after merge from main - Fix _content_filters → _preview_filters in arxiv, openalex, and arxiv_coverage tests (engines moved journal filter to preview phase) - Restore migration test assertions from main (0005 not 0008) - Add citation_metadata to EXPECTED_TABLES in schema stability test - Wrap create_default settings read in try/except to prevent propagation when settings_snapshot raises (fixes S2 coverage test) * fix(security): prevent exception details from leaking to API responses CodeQL flagged that raw exception text (e.g. stack traces, internal paths) was flowing from download_journal_data's error message to the JSON API response at /api/journal-data/download. Two fixes: 1. Route handler: separate success/failure paths — on failure, return generic "Download failed" to user, log full details internally 2. Downloader: remove {e} from return message, use logger.exception instead (logs full traceback server-side without exposing to user) * refactor: deduplicate papers + add 50 tests (#3446) * refactor: deduplicate citation_metadata into papers + paper_appearances Replace the 1:1 citation_metadata table with a properly deduplicated schema: papers (unique per paper, deduped by DOI/arXiv/PMID waterfall) + paper_appearances (junction table linking papers to research resources). Fixes inflated paper counts in dashboard queries. Migration 0006 rewritten since it hasn't been released yet. * test: add 28 tests for journal filter tiers, scoring, and new fields - test_journal_filter_tiers.py: predatory auto-removal, whitelist override, OpenAlex/DOAJ tiers, dedup, fail-fast, stale cache, DB error safety net - test_scoring_edge_cases.py: negative h-index, invalid quartile, Q1+h=0, normalize_name edge cases, three-way priority - test_openalex_new_fields.py: source_id extraction, field forwarding, S2 venue→journal_ref mapping * refactor: slim Paper model to indexed columns + JSON metadata blob Out of 16 columns on Paper, only 4 are ever queried: doi, arxiv_id, pmid, journal_id. The other 12 were dead storage. Collapse them into a single paper_metadata JSON blob (hybrid relational-JSON pattern used by OpenAlex/Crossref). SQLCipher compatibility verified: JSON1 extension enabled by default, LDR already uses 34 JSON columns in encrypted DBs successfully. Python attribute `paper_metadata` maps to SQL column `metadata`, mirroring ResearchResource.resource_metadata pattern to avoid SQLAlchemy's reserved `metadata` attribute. - citation.py: 13 columns → 4 + 1 JSON blob - migration 0006: matching slim schema (unreleased, no data migration) - research_sources_service.py: splits fields into indexed vs metadata - _merge_identifiers: new signature (paper, indexed, metadata); merges missing keys into paper_metadata without overwriting All 309 tests pass including encrypted DB ORM tests. * fix: address Round 1+2 review findings on Paper schema slim 1. datetime.date JSON serialization: convert publication_date to ISO string in normalize_citation after _build_csl_json consumes it 2. _merge_identifiers SQLAlchemy dirty tracking: copy dict before mutating so reassignment is detected by plain JSON column 3. UNIQUE constraints on doi/arxiv_id/pmid to prevent concurrent duplicate writes; handle IntegrityError via rollback + refetch 4. container_title lookup chain: add container_title/container-title keys for CSL-style callers 5. Per-source exception logging: warning → exception for stack traces * fix: address Round 3 review findings on journal quality data flow Critical bugs: 1. Journal name case mismatch broke Paper.journal_id linking - research_sources_service.py: _resolve_journal_id used .lower() but the filter writes Journal.name in mixed case. Every Paper got journal_id=None silently. - Fix: use func.lower() on both sides for case-insensitive match 2. AttributeError crash when source["metadata"] is a non-dict - citation_normalizer.py: source.get("metadata", {}).get("journal") crashes when metadata is a string (default only applies when key is absent/None). Fix: explicit isinstance check before .get(). 3. Author dict passthrough allows non-JSON-serializable fields - citation_normalizer.py: engines like OpenAlex/S2 return author dicts with nested affiliation objects, ORCIDs, etc. that may not be JSON-safe. Whitelist only CSL name fields (family, given, suffix) when passing through existing CSL-format author dicts. 4. predatory_source missing from API response - metrics_routes.py: template reads j.predatory_source for the tooltip but the route didn't emit it. Added to both journal aggregation responses. * fix: address Round 4 review findings on transaction safety and JSON sanitization Critical bugs: 1. resource_metadata stores raw untrusted source dict - Engine result dicts can contain non-JSON-serializable values (nested objects, numpy types, affiliations, date objects). Raw embedding would crash json.dumps() at flush time and silently lose the source via the per-source except catch. - Fix: new _json_safe() recursive sanitizer coerces everything to JSON primitives before embedding in resource_metadata. 2. db_session.rollback() wiped entire batch, not just failed source - The IntegrityError retry path and per-source except used a full session rollback, which lost every previously flushed source in the same batch. Also left stale resource.id references that pointed to rolled-back rows. - Fix: wrap each source in db_session.begin_nested() savepoint. Per-source rollback only affects that source. Earlier successes stay persisted. IntegrityError retry restarts a new savepoint and recreates the ResearchResource cleanly. * test: add Paper dedup integration tests + harden _json_safe Round 5 additions: 1. tests/database/test_paper_dedup_integration.py — 3 integration tests using a real encrypted SQLCipher database: - Paper created with indexed columns + metadata blob - Same DOI deduped across two sources (1 Paper, 2 PaperAppearances) - Metadata blob survives JSON round-trip through SQLCipher 2. _json_safe hardening: depth limit (32) + id()-based cycle detection to prevent RecursionError on pathological input. * fix: harden DB session handling and ArXiv journal_ref forwarding (Round 3) - Wrap __save_journal_to_db in try/except to handle DB session failures gracefully (e.g., encrypted DB with wrong password). Score is still valid but won't be cached until next successful DB access. - Explicitly forward journal_ref in ArXiv _get_full_content to prevent fragile reliance on item.copy() preserving the field. * fix: preview_filters resource leak, DOAJ Seal scoring, close() warning (Round 4-5) Three fixes from code review rounds 4-5: 1. CRITICAL: BaseSearchEngine.close() now also closes _preview_filters. Previously only _content_filters were closed, but the journal filter is registered as a preview_filter — its SearXNG engine and LLM client were never released. 2. DOAJ Seal scoring: use max(h_index_score, doaj_score) instead of strict h_index priority. 5,882 DOAJ Seal journals with moderate h-index were penalized because h-index score (e.g., 7) overrode the Seal floor (8). The DOAJ Seal represents OA best practices compliance, an orthogonal quality signal that should reinforce, not conflict. 3. Suppress spurious close() warning when SearXNG is None (normal case when SearXNG is not configured). Pass allow_none=True to safe_close. * fix: S2 publicationVenue, NASA ADS ArXiv preprints, test gaps 1. S2: request publicationVenue (structured, with ISSN) from API 2. NASA ADS: set journal_ref=None for ArXiv preprints (is_arxiv=True) 3. Fix vacuous test_doaj_with_seal assertion (was always true) 4. Add fail-fast behavioral test (verify Tier 4 skipped after 2 failures) 5. Clarify pyproject.toml setuptools sections * fix: Round 4 review findings Critical: - build_db now writes to tmp path and uses os.replace() for atomicity. Prevents corrupt DB on disk if build crashes mid-way. Scoring correctness: - Tier 3.6 (LLM cleanup → OpenAlex retry) now passes quartile to derive_quality_score. Previously Q1 journals found via this tier scored 8 instead of 10. Consistency: - PubMed journal_ref now uses None (not '') for missing journals, matching all other engines. - NASA ADS, OpenAlex, Semantic Scholar _get_full_content now forward all quality-relevant metadata fields (doi, affiliations, citations) to final results for downstream consumers. * fix: Round 5 review — scoring correctness and data source safety Scoring (scoring.py): - Apply DOAJ Seal floor in quartile branch via max() so Q4+Seal returns 8 instead of 5. Previously the Seal signal was silently discarded when quartile was present. - Treat negative h-index as no signal (return None for fall-through) instead of JOURNAL_QUALITY_DEFAULT=4. Consistent with h_index=0/None. DB build (db.py): - Recompute `quality` column after quartile assignment, so the stored quality agrees with the live-filter score. Data source safety: - OpenAlex: refuse to overwrite if fetched < 10K records. - JabRef: refuse to overwrite if fetched < 100 abbreviations. * fix: Round 6 review — concurrency, pool, and edge cases DB engine pool: - Use StaticPool for immutable=1 SQLite (was default QueuePool/15 conn). - Acquire lock before reading _engine to remove DCLP hazard. Downloader: - Atomic O_CREAT\|O_EXCL sentinel instead of exists()+touch() race. Filter: - Strip whitespace journal_ref; ' ' no longer bypasses the guard. - Handle clean_name == '' as no-venue instead of degenerate key. - Predatory removal log includes original journal_ref, cleaned name, URL. * fix: Round 7 review — caching, error visibility, SSRF hardening - Tier 3.6 now saves to DB so future queries skip LLM cleanup step - __save_journal_to_db warning passes exc_info=True for debuggability - OpenAlex manifest URLs validated against expected s3://openalex/ prefix * fix(journal-quality): atomic rename, engine reset on error, LIKE escape - build_db writes to a tmp path and os.replace()s at the end so a crash mid-build or a concurrent Windows reader (unlink-on-open fails on Windows) can no longer leave a corrupt file that blocks every subsequent query. - _ensure_engine validates PRAGMA user_version and integrity before wiring the RO engine so stale-schema or corrupt files get rebuilt at open time instead of erroring at first query. - session() drops the cached engine on OperationalError/DatabaseError so a transient corruption no longer wedges the process. - get_journals_page / get_institutions_page escape LIKE metachars and cap search length to close an authenticated CPU-DoS surface. - Startup sweep clears stale journal_quality.db.tmp-* files left by prior crashed builds. - Corrects stale entry in custom-checks raw-SQL allowlist (this file was renamed since the allowlist was written). * fix(db): enable PRAGMA foreign_keys = ON on every connection SQLite defaults foreign_keys to OFF, which meant every ondelete=CASCADE and ondelete=SET NULL declared on an FK was inert. Bulk Query.delete() calls — which bypass ORM cascade — then silently orphaned child rows, and Paper.journal_id would not NULL out when a Journal was deleted. Wiring the pragma into apply_performance_pragmas (which is already registered via event.listen(engine, "connect")) makes every pooled connection honor DDL-level cascade. * fix(migrations): 0007 index guard, remove redundant Paper indexes, add 0009 - 0007 now gates index creation on index existence (via inspector) instead of on whether the column was added this run. A DB where the columns already existed from a prior partial upgrade or from ORM create_all will now get the named indexes. - 0007 docstring header had stale revision IDs from a copy-paste. - Drop the redundant explicit Index() entries and index=True on Paper's doi/arxiv_id/pmid and PaperAppearance.resource_id — these columns already carry UNIQUE, which creates a backing index. - New migration 0009 backfills journal indexes that the old 0007 guard skipped, adds ix_research_resources_research_id (previously unindexed FK forced a full scan on every research-detail join), and adds the journals.name_lower column + index that _resolve_journal_id needs to avoid func.lower() expression scans. * perf(journals): name_lower column, indexed research_id, load_only on dedup - Journal gains a name_lower column, populated on every write by the reputation filter and used by _resolve_journal_id for an indexed equality lookup instead of func.lower(Journal.name), which defeats the name index. - research_resources.research_id declared with index=True so every research-detail join uses the index instead of a full scan. The matching migration that creates it on existing DBs is 0009. - _find_existing_paper applies load_only(id, doi, arxiv_id, pmid, journal_id) to the three dedup lookups so they no longer fetch the paper_metadata JSON blob (which can be multi-KB) just to check an identifier match. * fix(tests): bump head revision asserts + relax llm_utils header check - test_migration_0005_resource_document_id.py asserted the full-chain head is still "0005", which broke as soon as 0006/0007/0008 landed (now 0009). Bump the three full-chain asserts to "0009" and keep the targeted upgrade-to-0005 asserts at "0005" since those call _run_upgrade_to(..., "0005") explicitly. Also rename the two head-revision tests to match. - test_uses_auth_headers mocked requests.get and asserted an exact header dict, but safe_get wraps requests and injects a project User-Agent. Check that the Authorization header survives instead of doing a full dict equality. - Relax _validate_existing_db: PRAGMA user_version = 0 is the pre-stamping default, so treat it as grandfathered-in rather than triggering a rebuild. Only non-zero, non-current values force a rebuild. This keeps CI environments with pre-built DBs working. * ci: retrigger after Round 7 fixes * fix: Round 8 review — data source safety, DB validation, error visibility db.py: - Remove duplicate safe_close() in _validate_existing_db schema-mismatch branch. The finally block already handles closing; the extra call produced a spurious "Cannot operate on a closed database" warning on every schema-triggered rebuild. - Move reset_db() to before os.replace() so no new engine can latch onto the file mid-swap and then get disposed out from under an in-flight query. doaj.py: - Add _MIN_DOAJ_JOURNALS=5,000 floor. Prevents overwriting good data with {} if DOAJ CSV schema changes upstream (column rename breaks ISSN lookups, parser silently produces zero entries). institutions.py: - Add _ALLOWED_PREFIX="s3://openalex/" manifest validation loop matching openalex.py — defense-in-depth SSRF block. - Add _MIN_INSTITUTIONS=50,000 floor (snapshot has ~120K). jabref.py: - logger.warning → logger.exception for per-file fetch failures so tracebacks are preserved. Operators diagnosing partial fetches need the exception type, not just the filename. StaticPool kept as-is — the tradeoff (immutable=1 + single conn vs QueuePool overhead) was settled in prior rounds; reviewer's concern was theoretical and hasn't materialized. * fix: CI failures — raw SQL allowlist + filter test data-download stub Two concrete CI fixes after investigating the PR 3081 pytest failures: 1. test_no_raw_sql was flagging journal_quality/db.py line 207 for `conn.execute("PRAGMA user_version")`. This is a legitimate read- only schema-version check (cheap, no SQLAlchemy overhead, matches the pattern already skipped for database/initialize.py). Added journal_quality/db.py to the skip list. 2. Many filter unit tests were timing out at 60s in CI because they hit the real data-download path on a fresh container. Trace: filter_results → __clean_journal_name → expand_abbreviation → _ensure_engine → _build_or_raise → ensure_journal_data → download_journal_data (OpenAlex + DOAJ + JabRef fetch). Added tests/advanced_search_system/filters/conftest.py with an autouse fixture that stubs _build_or_raise to raise FileNotFound. expand_abbreviation already catches that and returns None, so the filter falls through to its own scoring tiers without touching the network. Tests run in 5.5s locally (was passing because my local DB is built). * fix(tests): use ResearchHistory UUID for ResearchLog FK test_research_logs was inserting Integer research.id into ResearchLog.research_id, which is String(36) FK at research_history.id (UUID). Previously latent because SQLite FK enforcement was off; commit `5078c867e` turned PRAGMA foreign_keys = ON on every connection, exposing the pre-existing mismatch. Production log_utils already writes UUIDs, so the FK is correct — the test was wrong. * fix(migrations): timezone-aware DateTime in 0006 + extend hook to scan migrations Migration 0006_add_citation_metadata declared three sa.DateTime() columns without timezone=True, contradicting the ORM (citation.py uses UtcDateTime). Add timezone=True to the three columns (papers.created_at, papers.updated_at, paper_appearances.created_at). The check-datetime-timezone pre-commit hook missed this because its path filter only scanned src/.../database/models/. Extend the path filter to include database/migrations/versions/, and teach the AST walker to also recognise sa.Column()/sa.DateTime() (attribute-style) — not just the bare Column()/DateTime() form used in ORM models — and accept sa.DateTime(timezone=True) as valid for migration files. * fix(citations): support old-format arXiv IDs in URL extraction The regex r"arxiv\.org/abs/(\d+\.\d+)" only matched new-format IDs (YYMM.NNNN). Pre-2007 papers with identifiers like cond-mat/0501001, math.AG/0601001, and hep-th/9802150 silently returned None. New regex accepts: - Old-style archive(.SubjectClass)?/YYMMNNN (with optional uppercase subject class like math.AG); archive can contain hyphens like cond-mat / hep-th - New-style YYMM.NNNN or YYMM.NNNNN (5-digit seq from 2015) - Optional vN version suffix (2501.12345v2) Also adds 5 new tests in TestExtractArxivId covering all three old-format variants plus version suffix and 5-digit sequence. * fix(journal_quality): surface build_db failure to downloader caller Previously download_journal_data swallowed any build_db() exception with a log-and-continue, then returned (True, "Fetched ...") as if everything worked. The dashboard saw a green success toast even when no DB was built. Capture the exception and return (False, msg) carrying the reason, while preserving the "lazy-build on next access" design — the runtime accessor still rebuilds from the downloaded .gz files on next access if the DB is absent. The existing callers (ensure_journal_data, metrics_routes.py) already pivot correctly on the bool, so this only flips a misleading green to an honest red. Tests: - test_successful_fetch now patches build_db to a no-op so the happy-path assertion is deterministic regardless of whether the minimal fixture is buildable end-to-end. - Adds test_build_db_failure_returns_false covering the new (False, msg) contract. * docs(journal-quality): clarify score scale is non-contiguous The docs and settings description previously advertised a "1-10 scale" and referenced score 3 ("Unknown") in the threshold table, but the code only emits {1, 4, 5, 6, 7, 8, 10}. Values 2, 3, and 9 are never assigned (the default/unknown case emits 4, not 3). - Fix the opening scale claim to note the non-contiguous emission. - Replace the "Score 3 = Unknown" row with "Score 4 = Default" so the table matches constants.py (JOURNAL_QUALITY_DEFAULT=4). - Correct the threshold table: thresholds 3 and 4 now behave the same as 2 (since 2 and 3 aren't emitted scores), and raising to 5 is what starts dropping default/unknown venues. - Update default_settings.json description and regenerate golden master to match. * fix(journal-quality): remove score-3 references (score 3 is never emitted) Scoring pipeline emits {1, 4, 5, 6, 7, 8, 10}; value 3 is reserved but never returned. Completes the cleanup begun in `0fe435bfc`, which fixed the table and settings description but left three residuals: - search_utilities.py::_format_quality_tag — the `>= 3` branch was unreachable for score 3 but caught score 4 (JOURNAL_QUALITY_DEFAULT), silently rendering unknown/default venues as [Q3 ★★]. Give score 4 a dedicated [Unranked ★] label so Q-tier labels stay truthful to SCImago quartile semantics. - docs/journal-quality.md step 7 "Score 3 floor" — the code actually returns None on no-signal. Rewrite as "No-signal pass-through". - journal_quality.html threshold descriptions — thresholds 3 and 4 both behave identically to threshold 2 (no emitted score falls in the 2–4 gap); score 4 only starts being dropped at threshold 5. Corrected both the HTML list and the JS threshold-detail map. Tests updated: test_default_unknown_tier asserts [Unranked ★] for score 4; test_score_boundary_5_is_q2_not_unranked pins the boundary. * fix(journal-quality): simplify Tier 0 cache to LLM-only and fix 9 correctness bugs (#3510) * feat(journal-quality): fix cache bugs and simplify to LLM-only Stacked on PR #3081. Review of #3081 surfaced 10 issues in the journal quality system. The dominant bug: the Tier 0 cache read predicate filters on `score_source == "llm"`, so Tier 2 (OpenAlex) and Tier 3 (DOAJ) scores were written to the user DB but never read back. This PR scopes the cache to LLM-only (per user direction: "we don't even need to cache [Tier 2/3]") and fixes the remaining 4 functional bugs. Bugs fixed: * Tier 0 cache broken for Tier 2/3 → drop Tier 2/3/3.6 write-back; keep Tier 4 LLM cache; migration 0010 drops 16 cache-only columns. * Paper dedup waterfall → single OR query; logs warning on conflict. * ISSN dashes not normalized → new normalize_issn() in citation_normalizer, applied at both reference-DB lookup and ingestion (openalex, doaj). * Migration 0009 SQL backfill wrong for diacritics → Python name.lower() batch loop matches runtime insert path exactly. * LLM out-of-set scores silently accepted → raise ValueError; existing failure counter + circuit breaker surface prompt drift. * quality_model not in cache predicate → add get_model_identifier helper and filter on it so cache invalidates across LLM upgrades. * Journal upsert race → savepoint + IntegrityError + refetch pattern mirroring the Paper upsert. * Cache-read validates cached quality ∈ VALID_QUALITY_SCORES; evicts pre-fix 2/3/9 values. * OpenAlex JSON parse now try/except + malformed-line counter; existing MIN_OPENALEX_SOURCES floor still aborts catastrophic failures. * Per-user metrics dashboard rewritten to join user Journal with the reference DB by name for display bibliometric fields. Schema: migration 0010 drops 16 bibliometric columns from journals (h_index, sjr_quartile, is_predatory, …); keeps name, name_lower, quality, score_source, quality_model, quality_analysis_time. Tests: 298 tests green across filters, citation_normalizer, llm_utils, paper dedup. Existing cached-quality test updated for new predicate chain; LLM clamp test now asserts ValueError instead of silent clamp. * fix(journal-quality): bundle migration 0010 drops into single batch + docs Bundle all 19 ops (3 index drops + 16 column drops on upgrade, 16 column adds + 3 index creates on downgrade) into a single `batch_alter_table` block each. SQLite has no in-place ALTER DROP COLUMN, so alembic's batch mode recreates the whole table per block — the previous per-op loop paid that cost 19 times. Bundling also makes each direction atomic: an error mid-batch rolls back cleanly, eliminating partial-schema states the per-op version could leave behind. Also update docs/journal-quality.md to reflect the LLM-only cache scope: the old docs claimed "Tier 0 — Database Cache: Instant lookup from previous scoring. Journals are scored once and cached." which describes the pre-fix behavior. The new description positions Tier 0 between 3.5 and 3.6 (where it actually fires) and explains that only Tier 4 results are persisted — reference-DB lookups for Tiers 1–3.5 are already instant and get re-checked every query. No behavior change beyond the migration perf win. * fix(journal-quality): address 100-agent review feedback P1 — predatory_blocked global count: The Tier 0 cache rewrite in /api/journals/user-research turned `predatory_blocked` from a global count across all user journals into an in-page count (top 200). AI code reviewer and R10-4 both flagged this as a semantic regression — summary stats are expected to be global, matching `total_journals` which is still global. Fix: add `JournalQualityDB.count_predatory_by_names(names)` helper that issues one `WHERE name_lower IN (…) AND is_predatory = TRUE` query, call it with ALL user journal names from `/api/journals/user-research`. The per-research endpoint is already correctly scoped to the research (no 200-limit) and is left unchanged. P2 — Journal schema stability test: R1-3 and R9-10 both flagged that tests/database/test_schema_stability.py verifies table names but not column-level shape. Migration 0010 deliberately trims Journal to 7 columns; an accidental model addition without a matching migration would slip through silently. Added TestCriticalColumns.test_journal_has_exact_column_set asserting the exact column set {id, name, name_lower, quality, score_source, quality_model, quality_analysis_time}. P3 — polish: - Add `# noqa: silent-exception` + explanatory comments to `_ref_db_lookup` and `_get_ref_db_or_none` (project convention for best-effort broad catches). - Update `logs.py` module docstring to explain Journal's LLM-only cache scope after migration 0010. - Clarify `quality_analysis_time` column comment is "Unix seconds (not ms)" and rationale for Integer (vs UtcDateTime) typing. - Add `__all__` declarations to `utilities/citation_normalizer.py` and `utilities/llm_utils.py` codifying the public API surface. No behavior change beyond P1. 305 tests still green across filters, citation_normalizer, llm_utils, paper dedup, schema stability; 54 metrics route tests still green. * fix(journal-quality): prod-ready polish for PR #3081 — migration squash + ops hardening (#3513) * feat(journal-quality): clearer log milestones around first-run DB build The "Building X ..." message is too terse — on a fresh install the ~30s download + insert looks like a hung process. Expand the start message to mention the one-time nature + the download size, and include the source count in the completion log so the server log tells operators when the DB is ready to serve scoring. Addresses the UX gap previously considered a blocker: users already see the server log, so a milestone log line is enough (no UI progress event needed). * fix(journal-quality): set Windows readonly attribute after chmod chmod 0o444 is a no-op on Windows — the compiled journal-quality reference DB stays writable on Windows installs, violating the read-only invariant. Combine the POSIX chmod with a best-effort SetFileAttributesW(FILE_ATTRIBUTE_READONLY) on win32. Log a warning if SetFileAttributesW fails; the check-journal-quality-readonly.py pre-commit hook still enforces read-only opens in consumer code. * feat(journal-quality): pre-check free disk space before bulk download The five journal-quality data sources uncompress to ~1 GB of intermediate working set plus the compiled reference DB. On a small-disk machine, a mid-stream failure can leave an orphan .tmp-* file that blocks the next build. Fail fast with a clear "X.X GB available, 2 GB required" message before touching the sentinel or the network. Threshold is exposed as JOURNAL_QUALITY_MIN_FREE_DISK_BYTES in constants.py so ops can tune it if needed. OSError from shutil.disk_usage is non-fatal (logged, build proceeds) — don't block a download just because disk stats are unavailable. * security(journal-quality): stop leaking exception text into HTTP path CodeQL alerts 7650 and 7684 flagged that str(exc) from a build_db failure in download_journal_data() flows into the tuple's message string, and from there through to the /api/journal-data/download response. SQLAlchemy errors embed SQL statements and file paths — sanitize at the source by returning only the exception class name. Full traceback remains in logger.exception (server-side only). Add tests/journal_quality/test_downloader_exception_sanitization.py asserting that a simulated build_db error whose message contains stack-trace-shaped substrings never reaches the caller. * feat(safe-requests): add safe_get_with_retries and wire into journal-quality downloads Bulk journal-data downloads currently abort on the first transient network failure: a packet drop or short AWS S3 hiccup forces the user to restart from scratch. Add a safe_get_with_retries wrapper with exponential backoff (1/2/4s, 3 attempts by default), retrying on ConnectionError, Timeout, HTTP 429, and HTTP 5xx. Honors the Retry-After header when present. SSRF ValueErrors and non-429 4xx responses are passed through unchanged. The five journal-quality data sources (OpenAlex, DOAJ, predatory, JabRef, institutions) now import the retry wrapper instead of the bare safe_get. Call sites are unchanged beyond the import alias. * feat(journal-quality): detect OpenAlex field-level schema drift OpenAlex occasionally renames snapshot fields (the Works schema has seen h-index and ref-count migrations in the last year). The existing row-count floor catches a collapsed fetch but cannot tell the difference between "212K journals with h_index correctly populated" and "212K journals all silently None because the field was renamed". Sample the first 100 parsed rows after the parse loop and refuse to overwrite the snapshot if every one of them has h_index == None or every one has cited_by_count == None. Raise a new SchemaDriftError so operators can grep for it in logs and the CI release-gate job can fail fast on upstream breakage. * fix(migrations): squash the journal-model churn in 0007 + keep 0008/0010 as stubs The pre-squash chain had 0007 add 17 bibliometric / trust-signal columns + 3 indexes to the per-user journals table, 0008 add a sjr_quartile column + index, and 0010 drop all of 0007/0008's additions except three. On SQLite every batch_alter_table is a full-table rebuild, so every live user pays for TWO back-to-back rebuilds on the journals table within a single release for no net schema gain. New shape: 0007 adds only the columns the final form keeps — name_lower, score_source, quality_model — plus their indexes and the name_lower Python-side backfill (moved from 0009, because a Unicode-correct backfill belongs with the column that needs it). Downgrade drops the three it added. 0008 and 0010 remain as no-op stubs. A user whose alembic_version row reads "0008" or "0010" from a prior upgrade still needs a revision to walk through; deleting the files would strand them. Stubs are cheap, one return statement each, and keep the chain contiguous without forcing anyone to rewrite history. 0009 is simplified to its one remaining unique responsibility (ix_research_resources_research_id); the journals.name_lower work it used to duplicate now lives in the squashed 0007. Verified end-to-end against 206 existing migration + schema tests (including the full chain's up/down/up stairway per revision) and four new squash-specific regressions in tests/database/test_journal_migration_squash.py: - chain reaches head 0010 with the 7-column final shape - name_lower backfill handles diacritics (Café → café) - re-running run_migrations is idempotent - squashed 0007 is a no-op on a DB already stamped at 0010 * fix(safe-requests): cap Retry-After + parse HTTP-date form A hostile or misconfigured upstream returning a large `Retry-After` integer can pin a Flask worker via `time.sleep()` — the call chain from `/api/journal-data/download` to `safe_get_with_retries` is fully synchronous. Cap at 300 s and extend the parser to the RFC 7231 HTTP-date form (previously the `ValueError` from `int()` was silently swallowed). Negative values clamp to 0 to avoid `time.sleep(-5)`, which CPython rejects. Also drops dead `last_response` bookkeeping from the retry loop — the path that referenced it was removed two commits back. tests/security: add four retry tests — cap enforced, HTTP-date parsed, unparseable falls back to schedule, negative clamps. tests/database: replace the squash-scenario test with one that actually creates the pre-squash 17-column journals shape via `ALTER TABLE`, stamps at `0006` so 0007 runs (including the `name_lower` backfill), walks to head, and verifies both column preservation and the diacritic backfill. The prior test only proved Alembic's built-in "don't re-run at head" guarantee; its docstring is tightened to match. * chore(pr-feedback): document orphan-column intent + log skipped drift check Follow-up to the Friendly AI Reviewer pass on #3513. Two substantive nits addressed, three stylistic ones deferred (see /plans in review thread for the full breakdown). tests/database: the pre-squash walk test asserts `"issn" in cols` as a success condition. Without context, that reads as "orphan columns are fine" rather than "orphan columns are the intended trade-off of the stub-based squash". Expand the docstring and the inline comment so future maintainers don't misread the intent. journal_quality: the schema-drift check is a no-op when the parsed sample has < _SCHEMA_SAMPLE_SIZE entries (a branch that only fires on truncated test snapshots or aggressive parse filters — the 10k-row floor above catches a collapsed fetch). Previously silent; now logs at debug so operators can see it was bypassed. * chore(pr-feedback): surface orphan-column trade-off in migration docstring Second AI-reviewer pass asked for the orphan-column note to live in the migration docstring (where maintainers look first during a schema-change investigation), not just the regression test. Copy the trade-off rationale into 0007's header. Also promote the "schema-drift check skipped" log from debug to info — debug-level messages are typically filtered out in production log configs, which defeats the observability goal of the branch. The skip is rare (OpenAlex ships ~280K sources; the `<100` sample only arises from truncated test snapshots or aggressive parse filters), so info-level noise is negligible. * refactor(journal-quality): cleanup + preventative security (stacked on #3513) (#3514) * feat(journal-quality): clearer log milestones around first-run DB build The "Building X ..." message is too terse — on a fresh install the ~30s download + insert looks like a hung process. Expand the start message to mention the one-time nature + the download size, and include the source count in the completion log so the server log tells operators when the DB is ready to serve scoring. Addresses the UX gap previously considered a blocker: users already see the server log, so a milestone log line is enough (no UI progress event needed). * fix(journal-quality): set Windows readonly attribute after chmod chmod 0o444 is a no-op on Windows — the compiled journal-quality reference DB stays writable on Windows installs, violating the read-only invariant. Combine the POSIX chmod with a best-effort SetFileAttributesW(FILE_ATTRIBUTE_READONLY) on win32. Log a warning if SetFileAttributesW fails; the check-journal-quality-readonly.py pre-commit hook still enforces read-only opens in consumer code. * feat(journal-quality): pre-check free disk space before bulk download The five journal-quality data sources uncompress to ~1 GB of intermediate working set plus the compiled reference DB. On a small-disk machine, a mid-stream failure can leave an orphan .tmp-* file that blocks the next build. Fail fast with a clear "X.X GB available, 2 GB required" message before touching the sentinel or the network. Threshold is exposed as JOURNAL_QUALITY_MIN_FREE_DISK_BYTES in constants.py so ops can tune it if needed. OSError from shutil.disk_usage is non-fatal (logged, build proceeds) — don't block a download just because disk stats are unavailable. * security(journal-quality): stop leaking exception text into HTTP path CodeQL alerts 7650 and 7684 flagged that str(exc) from a build_db failure in download_journal_data() flows into the tuple's message string, and from there through to the /api/journal-data/download response. SQLAlchemy errors embed SQL statements and file paths — sanitize at the source by returning only the exception class name. Full traceback remains in logger.exception (server-side only). Add tests/journal_quality/test_downloader_exception_sanitization.py asserting that a simulated build_db error whose message contains stack-trace-shaped substrings never reaches the caller. * feat(safe-requests): add safe_get_with_retries and wire into journal-quality downloads Bulk journal-data downloads currently abort on the first transient network failure: a packet drop or short AWS S3 hiccup forces the user to restart from scratch. Add a safe_get_with_retries wrapper with exponential backoff (1/2/4s, 3 attempts by default), retrying on ConnectionError, Timeout, HTTP 429, and HTTP 5xx. Honors the Retry-After header when present. SSRF ValueErrors and non-429 4xx responses are passed through unchanged. The five journal-quality data sources (OpenAlex, DOAJ, predatory, JabRef, institutions) now import the retry wrapper instead of the bare safe_get. Call sites are unchanged beyond the import alias. * feat(journal-quality): detect OpenAlex field-level schema drift OpenAlex occasionally renames snapshot fields (the Works schema has seen h-index and ref-count migrations in the last year). The existing row-count floor catches a collapsed fetch but cannot tell the difference between "212K journals with h_index correctly populated" and "212K journals all silently None because the field was renamed". Sample the first 100 parsed rows after the parse loop and refuse to overwrite the snapshot if every one of them has h_index == None or every one has cited_by_count == None. Raise a new SchemaDriftError so operators can grep for it in logs and the CI release-gate job can fail fast on upstream breakage. * fix(migrations): squash the journal-model churn in 0007 + keep 0008/0010 as stubs The pre-squash chain had 0007 add 17 bibliometric / trust-signal columns + 3 indexes to the per-user journals table, 0008 add a sjr_quartile column + index, and 0010 drop all of 0007/0008's additions except three. On SQLite every batch_alter_table is a full-table rebuild, so every live user pays for TWO back-to-back rebuilds on the journals table within a single release for no net schema gain. New shape: 0007 adds only the columns the final form keeps — name_lower, score_source, quality_model — plus their indexes and the name_lower Python-side backfill (moved from 0009, because a Unicode-correct backfill belongs with the column that needs it). Downgrade drops the three it added. 0008 and 0010 remain as no-op stubs. A user whose alembic_version row reads "0008" or "0010" from a prior upgrade still needs a revision to walk through; deleting the files would strand them. Stubs are cheap, one return statement each, and keep the chain contiguous without forcing anyone to rewrite history. 0009 is simplified to its one remaining unique responsibility (ix_research_resources_research_id); the journals.name_lower work it used to duplicate now lives in the squashed 0007. Verified end-to-end against 206 existing migration + schema tests (including the full chain's up/down/up stairway per revision) and four new squash-specific regressions in tests/database/test_journal_migration_squash.py: - chain reaches head 0010 with the 7-column final shape - name_lower backfill handles diacritics (Café → café) - re-running run_migrations is idempotent - squashed 0007 is a no-op on a DB already stamped at 0010 * refactor(journal-quality): lookup_institution returns full-name keys The on-disk JSON snapshot uses one-character keys (n, c, t, h, if, w, cb, r) to save bytes across ~200K institutions. That's fine on-disk but a bad Python API — callers have to memorize the mapping, and a future schema change breaks every caller silently. _institution_to_dict now returns full names (name, country, type, h_index, impact_factor, works_count, cited_by_count, ror_id). The snapshot-reading code in _populate_institutions keeps the compact keys — only the public accessor changes. Grep confirms zero live callers today (only a comment mention in search_engine_openalex.py), so no migration needed. * refactor(journal-quality): extract _openalex_common for shared S3 helpers openalex.py and institutions.py duplicated three symbols: _OPENALEX_S3_BASE, the `s3://openalex/` manifest prefix check, and the s3_to_https translator. djpetti flagged this in PR #3081 review. Move them to data_sources/_openalex_common.py (stdlib-only, no circular imports) and import from both data-source modules. The on-disk compact key format and manifest fetch URLs stay where they are; only the duplicated helpers move. * test(safe-requests): cover redirect-hop SSRF validation + DNS rebinding safe_requests.py has always validated every redirect hop against the SSRF allowlist (lines 208–250), but the existing test suite only exercised the initial request. These five new tests drive the redirect loop itself: - redirect target is a private IP → blocked - redirect target is AWS metadata (169.254.169.254) → blocked - redirect loop exceeds 10 hops → raises ValueError("Too many") - DNS-rebinding case (first hop validates, redirect validates false for the same hostname) → blocked on the second hop - a legitimate redirect from one public URL to another is followed * feat(search-utilities): HTML-safe variant of the journal quality tag _format_quality_tag emits plaintext like "[Q1 ★★★★★]" which is fine when the caller renders the containing string as Markdown or plain text. Today every caller does that, so there's no live XSS. But the tag is typically embedded alongside a search-result title that came from an external search engine, and the first HTML-rendered consumer that does {{ title + quality_tag \| safe }} or equivalent would leak any tags in the title. Add _format_quality_tag_html(quality, , title) that html.escape's the title (angle brackets, ampersands, quotes) and appends the plaintext tag. Existing callers are unchanged — this is the safe variant any future HTML-rendered caller should reach for. The existing helper gets a docstring warning so reviewers of future PRs know which variant is appropriate. test(db): migrations 0006-0010 on a SQLCipher-encrypted DB The existing test_encrypted_database_orm.py exercises ORM CRUD over an encrypted DB but never explicitly walks the new journal-quality chain. This test creates a fresh keyed DB via DatabaseManager (which runs the full migration chain as part of create_user_database), inserts a Journal row with every kept column, closes the engine, reopens with the same key, and reads the row back. The second test asserts the final journals column set (id, name, name_lower, quality, score_source, quality_model, quality_analysis _time) is exactly what test_schema_stability expects. Guards against SQLCipher key-ordering regressions where a future change to sqlcipher_utils would let batch_alter_table's rebuild path see a non-keyed connection. * test(db): data preservation across journals-table rebuild Adding name_lower + its index in the squashed 0007 triggers a SQLite batch_alter_table rebuild under the hood (ALTER ADD COLUMN is implemented as a full copy). The rebuild runs inside a single Alembic transaction, so SQLite guarantees atomicity — either the new table is fully populated or the original stays untouched. The test validates what successful output must look like: - 100 rows with a mix of ASCII, diacritics, CJK, and whitespace- wrapped names all survive the chain - name / quality_analysis_time values are preserved verbatim - name_lower is backfilled via Python's str.lower() (Unicode- correct, unlike SQLite's ASCII LOWER()) - no _alembic_tmp_journals orphan table is left behind Complements test_journal_migration_squash.py (which covers the simpler idempotency + head-stamp cases). * refactor(jabref): log abbreviation collisions at debug level The jabref downloader loads 14 CSV files in order and silently overwrites on duplicate keys. For abbreviations like "J Org Chem" that appear in multiple vocabularies (general + ACS) the last file loaded wins, with no audit trail. Emit a debug-level log line on each overwriting collision, mentioning the source filename, abbreviation, and the two competing full names. Debug level (not info/warning) because the collisions are expected — the current "last writer wins" behavior is kept, this is purely observability for operators who care to tail the log. * docs(doaj): flag ternary-to-binary seal-field collapse The DOAJ public CSV distinguishes three seal states: "yes", "no", and blank (application never submitted). scoring.py only needs the boolean floor today, so the importer collapses blank and "no" into has_seal=False. A future tier that rewards "applied and was denied" differently from "never applied" would need to preserve the raw ternary — add a comment so that future change isn't stalled rediscovering this. No functional change; code path unchanged. * chore(review-feedback): four follow-ups from the #3514 fixup review Addresses the must-fix + two should-fix items surfaced by a 3×10 subagent review pass. Three other flagged items (HTML-safe scaffold, _make_engine tempdir, fake_validate flag threading) are deferred with rationale noted in the planning file. db.py: the `lookup_institution` docstring advertised compact-format keys (n, c, t, h, …) left over from the pre-refactor dict layer. The accessor actually returns full-name keys via `_institution_to_dict` — update the docstring so the caller contract matches reality. test_safe_requests_redirects: the `test_dns_rebinding_case_blocked_on_second_hop` test does not model DNS rebinding; it mocks `validate_url` to return [True, False] for two distinct URLs. That's a per-hop re-evaluation test, not a rebinding one (which would require same hostname with different getaddrinfo results across calls). Rename to `test_second_hop_blocked_when_validator_rejects_redirect_target` and rewrite its docstring + the module docstring so the label stops overstating the coverage. Real rebinding coverage belongs alongside the validator unit tests and is flagged there as a follow-up. test_journal_migrations_encrypted: the test module had no sqlcipher3 guard — on a platform where sqlcipher3 is missing and `LDR_BOOTSTRAP_ALLOW_UNENCRYPTED=true` is set, `DatabaseManager` falls back to plain SQLite and the test silently passes. Add `pytest.importorskip("sqlcipher3", ...)` at module top to skip cleanly when the package is missing, and `assert db_manager.has_encryption` at the top of each test function to fail loudly when sqlcipher3 imports but the manager has turned encryption off for any reason. test_journal_rebuild_data_preservation: docstring claimed "every column value intact" but only `name` and `quality_analysis_time` are seeded and checked. Tighten the claim to what the test actually covers without reducing the real value the test adds (diacritic + CJK + padded-whitespace backfill coverage). * docs(journal-quality): predatory policy, release notes, and durability comment (#3516) * feat(journal-quality): clearer log milestones around first-run DB build The "Building X ..." message is too terse — on a fresh install the ~30s download + insert looks like a hung process. Expand the start message to mention the one-time nature + the download size, and include the source count in the completion log so the server log tells operators when the DB is ready to serve scoring. Addresses the UX gap previously considered a blocker: users already see the server log, so a milestone log line is enough (no UI progress event needed). * fix(journal-quality): set Windows readonly attribute after chmod chmod 0o444 is a no-op on Windows — the compiled journal-quality reference DB stays writable on Windows installs, violating the read-only invariant. Combine the POSIX chmod with a best-effort SetFileAttributesW(FILE_ATTRIBUTE_READONLY) on win32. Log a warning if SetFileAttributesW fails; the check-journal-quality-readonly.py pre-commit hook still enforces read-only opens in consumer code. * feat(journal-quality): pre-check free disk space before bulk download The five journal-quality data sources uncompress to ~1 GB of intermediate working set plus the compiled reference DB. On a small-disk machine, a mid-stream failure can leave an orphan .tmp-* file that blocks the next build. Fail fast with a clear "X.X GB available, 2 GB required" message before touching the sentinel or the network. Threshold is exposed as JOURNAL_QUALITY_MIN_FREE_DISK_BYTES in constants.py so ops can tune it if needed. OSError from shutil.disk_usage is non-fatal (logged, build proceeds) — don't block a download just because disk stats are unavailable. * security(journal-quality): stop leaking exception text into HTTP path CodeQL alerts 7650 and 7684 flagged that str(exc) from a build_db failure in download_journal_data() flows into the tuple's message string, and from there through to the /api/journal-data/download response. SQLAlchemy errors embed SQL statements and file paths — sanitize at the source by returning only the exception class name. Full traceback remains in logger.exception (server-side only). Add tests/journal_quality/test_downloader_exception_sanitization.py asserting that a simulated build_db error whose message contains stack-trace-shaped substrings never reaches the caller. * feat(safe-requests): add safe_get_with_retries and wire into journal-quality downloads Bulk journal-data downloads currently abort on the first transient network failure: a packet drop or short AWS S3 hiccup forces the user to restart from scratch. Add a safe_get_with_retries wrapper with exponential backoff (1/2/4s, 3 attempts by default), retrying on ConnectionError, Timeout, HTTP 429, and HTTP 5xx. Honors the Retry-After header when present. SSRF ValueErrors and non-429 4xx responses are passed through unchanged. The five journal-quality data sources (OpenAlex, DOAJ, predatory, JabRef, institutions) now import the retry wrapper instead of the bare safe_get. Call sites are unchanged beyond the import alias. * feat(journal-quality): detect OpenAlex field-level schema drift OpenAlex occasionally renames snapshot fields (the Works schema has seen h-index and ref-count migrations in the last year). The existing row-count floor catches a collapsed fetch but cannot tell the difference between "212K journals with h_index correctly populated" and "212K journals all silently None because the field was renamed". Sample the first 100 parsed rows after the parse loop and refuse to overwrite the snapshot if every one of them has h_index == None or every one has cited_by_count == None. Raise a new SchemaDriftError so operators can grep for it in logs and the CI release-gate job can fail fast on upstream breakage. * fix(migrations): squash the journal-model churn in 0007 + keep 0008/0010 as stubs The pre-squash chain had 0007 add 17 bibliometric / trust-signal columns + 3 indexes to the per-user journals table, 0008 add a sjr_quartile column + index, and 0010 drop all of 0007/0008's additions except three. On SQLite every batch_alter_table is a full-table rebuild, so every live user pays for TWO back-to-back rebuilds on the journals table within a single release for no net schema gain. New shape: 0007 adds only the columns the final form keeps — name_lower, score_source, quality_model — plus their indexes and the name_lower Python-side backfill (moved from 0009, because a Unicode-correct backfill belongs with the column that needs it). Downgrade drops the three it added. 0008 and 0010 remain as no-op stubs. A user whose alembic_version row reads "0008" or "0010" from a prior upgrade still needs a revision to walk through; deleting the files would strand them. Stubs are cheap, one return statement each, and keep the chain contiguous without forcing anyone to rewrite history. 0009 is simplified to its one remaining unique responsibility (ix_research_resources_research_id); the journals.name_lower work it used to duplicate now lives in the squashed 0007. Verified end-to-end against 206 existing migration + schema tests (including the full chain's up/down/up stairway per revision) and four new squash-specific regressions in tests/database/test_journal_migration_squash.py: - chain reaches head 0010 with the 7-column final shape - name_lower backfill handles diacritics (Café → café) - re-running run_migrations is idempotent - squashed 0007 is a no-op on a DB already stamped at 0010 * refactor(journal-quality): lookup_institution returns full-name keys The on-disk JSON snapshot uses one-character keys (n, c, t, h, if, w, cb, r) to save bytes across ~200K institutions. That's fine on-disk but a bad Python API — callers have to memorize the mapping, and a future schema change breaks every caller silently. _institution_to_dict now returns full names (name, country, type, h_index, impact_factor, works_count, cited_by_count, ror_id). The snapshot-reading code in _populate_institutions keeps the compact keys — only the public accessor changes. Grep confirms zero live callers today (only a comment mention in search_engine_openalex.py), so no migration needed. * refactor(journal-quality): extract _openalex_common for shared S3 helpers openalex.py and institutions.py duplicated three symbols: _OPENALEX_S3_BASE, the `s3://openalex/` manifest prefix check, and the s3_to_https translator. djpetti flagged this in PR #3081 review. Move them to data_sources/_openalex_common.py (stdlib-only, no circular imports) and import from both data-source modules. The on-disk compact key format and manifest fetch URLs stay where they are; only the duplicated helpers move. * test(safe-requests): cover redirect-hop SSRF validation + DNS rebinding safe_requests.py has always validated every redirect hop against the SSRF allowlist (lines 208–250), but the existing test suite only exercised the initial request. These five new tests drive the redirect loop itself: - redirect target is a private IP → blocked - redirect target is AWS metadata (169.254.169.254) → blocked - redirect loop exceeds 10 hops → raises ValueError("Too many") - DNS-rebinding case (first hop validates, redirect validates false for the same hostname) → blocked on the second hop - a legitimate redirect from one public URL to another is followed * feat(search-utilities): HTML-safe variant of the journal quality tag _format_quality_tag emits plaintext like "[Q1 ★★★★★]" which is fine when the caller renders the containing string as Markdown or plain text. Today every caller does that, so there's no live XSS. But the tag is typically embedded alongside a search-result title that came from an external search engine, and the first HTML-rendered consumer that does {{ title + quality_tag \| safe }} or equivalent would leak any tags in the title. Add _format_quality_tag_html(quality, , title) that html.escape's the title (angle brackets, ampersands, quotes) and appends the plaintext tag. Existing callers are unchanged — this is the safe variant any future HTML-rendered caller should reach for. The existing helper gets a docstring warning so reviewers of future PRs know which variant is appropriate. test(db): migrations 0006-0010 on a SQLCipher-encrypted DB The existing test_encrypted_database_orm.py exercises ORM CRUD over an encrypted DB but never explicitly walks the new journal-quality chain. This test creates a fresh keyed DB via DatabaseManager (which runs the full migration chain as part of create_user_database), inserts a Journal row with every kept column, closes the engine, reopens with the same key, and reads the row back. The second test asserts the final journals column set (id, name, name_lower, quality, score_source, quality_model, quality_analysis _time) is exactly what test_schema_stability expects. Guards against SQLCipher key-ordering regressions where a future change to sqlcipher_utils would let batch_alter_table's rebuild path see a non-keyed connection. * test(db): data preservation across journals-table rebuild Adding name_lower + its index in the squashed 0007 triggers a SQLite batch_alter_table rebuild under the hood (ALTER ADD COLUMN is implemented as a full copy). The rebuild runs inside a single Alembic transaction, so SQLite guarantees atomicity — either the new table is fully populated or the original stays untouched. The test validates what successful output must look like: - 100 rows with a mix of ASCII, diacritics, CJK, and whitespace- wrapped names all survive the chain - name / quality_analysis_time values are preserved verbatim - name_lower is backfilled via Python's str.lower() (Unicode- correct, unlike SQLite's ASCII LOWER()) - no _alembic_tmp_journals orphan table is left behind Complements test_journal_migration_squash.py (which covers the simpler idempotency + head-stamp cases). * refactor(jabref): log abbreviation collisions at debug level The jabref downloader loads 14 CSV files in order and silently overwrites on duplicate keys. For abbreviations like "J Org Chem" that appear in multiple vocabularies (general + ACS) the last file loaded wins, with no audit trail. Emit a debug-level log line on each overwriting collision, mentioning the source filename, abbreviation, and the two competing full names. Debug level (not info/warning) because the collisions are expected — the current "last writer wins" behavior is kept, this is purely observability for operators who care to tail the log. * docs(doaj): flag ternary-to-binary seal-field collapse The DOAJ public CSV distinguishes three seal states: "yes", "no", and blank (application never submitted). scoring.py only needs the boolean floor today, so the importer collapses blank and "no" into has_seal=False. A future tier that rewards "applied and was denied" differently from "never applied" would need to preserve the raw ternary — add a comment so that future change isn't stalled rediscovering this. No functional change; code path unchanged. * docs(journal-quality): document the predatory-list whitelist override Tier 1's auto-removal has a deliberate rescue clause: a journal flagged by Stop Predatory Journals is kept if it's listed in DOAJ or has h-index > PREDATORY_WHITELIST_HINDEX (default 10). This deliberately lets mainstream publishers who occasionally appear on community predatory lists (Frontiers, MDPI, Sage) through. The behavior has been in the code since the feature shipped, but it was undocumented — users seeing a flagged-but-not-removed journal had no way to tell whether that was a bug or a policy call. Add a "Predatory-List Overrides" section to docs/journal-quality.md explaining the rule, the rationale, and how to tighten or loosen it via PREDATORY_WHITELIST_HINDEX. * docs(release): pending notes for the journal-quality redesign Staging file documenting the changes introduced by #3081 so they can be folded into the next tagged version's release-notes file. Key entries: - Major features: tiered scoring, journal dashboard, quality tags - BREAKING: lists the 16 `journals` columns removed and points custom SQL consumers at the new reference DB accessor - Upgrade cost note (one-time per-user table rebuild, typically <1 s, 2–5 s on very large libraries) - Settings introduced (both opt-in) - Operational improvements carried by the PR A fix-up stack (Windows readonly, disk-space pre-check, download retries) * docs(journal-quality): explain synchronous=OFF durability tradeoff The reference-DB build sets PRAGMA synchronous=OFF during bulk insert. That looks scary at a glance because elsewhere in the codebase the same pragma would risk corruption, but here it's correct — the build writes to a unique .tmp-PID-RAND path, and any crash mid-build orphans that temp file while leaving the live DB untouched. The atomic os.replace() at the end of build_db is what provides durability, not synchronous=NORMAL. Add an inline comment so reviewers and grep-forensics readers don't need to reconstruct this from the surrounding code. * fix(journal-quality-docs): six accuracy fixes surfaced by 30-agent review docs/journal-quality.md - h-index quality bands: replace ≥ with strict > in the Quality Scale table and the Tier 2 threshold listing. scoring.py uses strict > at every boundary, so h=150 scores 8 (Strong), not 10 (Elite); the doc was off-by-one at every tier boundary. - Quality Scale "Strong" row: change "h-index 40-149" to "41-150" to match the actual band (`> 75` through `> 150` inclusive-ish). - Data-sources table: DOAJ row `~35K` → `~22K`. The code's three count claims (doaj.py docstring, description, _MIN_DOAJ_JOURNALS floor) all correctly say 22K, which matches the upstream DOAJ size. 35K overstates coverage by ~60%. - Predatory-list override rationale: drop "Frontiers, MDPI, Sage" from the false-positive example. Only Frontiers is actually in the Stop Predatory Journals CSVs this code ingests; MDPI and Sage are not. Neutral phrasing preserves the argument without misattributing flag status to specific publishers. docs/release_notes/pending-journal-quality-redesign.md - Settings section: "both opt-in" was wrong. The per-engine toggles default `true` (opt-out), and three sibling toggles (arxiv, openalex, nasa_ads) ship alongside the one the notes named. Rewrite as "1 opt-in + 4 opt-out" listing all five keys. - First-use download timing: "10-30 s" is under OpenAlex's own 30-60 s floor, and the five sources fetch sequentially in downloader.py. Widen to "1-2 minutes" with the OpenAlex-alone baseline called out so operators don't expect 10 s. src/local_deep_research/journal_quality/db.py - Broaden the synchronous=OFF comment's lede to include `journal_mode = OFF`. The atomic-rename invariant actually protects the whole pragma set, not just synchronous; the final "Do NOT copy this pragma set" warning was body-mismatched. * test(journal-quality): update stale assertions to match recent fixes Three tests lagged behind earlier commits on this branch: - test_journal_reputation_coverage: mock chain missed the new quality_model filter added in `55a99a7f2` (Tier 0 LLM-only cache). Both above/below-threshold cases get the extra .filter link. - test_db::test_print_and_electronic_issn_both_survive: ISSNs are stored in canonical no-dash form (normalize_issn) as of 55a99a7f2; assertion updated to match. - test_downloader::test_build_db_failure_returns_false: exception message is no longer surfaced to callers (info-disclosure hardening in `da803376d`); assert on exception class name instead. * fix(journal-quality-ui): correct whitelist copy + h-index band operators (#3525) Two UI-copy drifts surfaced by the review pass on #3516: - Trust-signals bullet for "Predatory" described the flag without mentioning the whitelist carveout, so a user seeing a predatory journal in their results had no way to tell why it survived. Add the DOAJ-or-PREDATORY_WHITELIST_HINDEX rescue clause. - Threshold-2 description had the same gap; match the trust-signal wording. - Threshold-slider descriptions for 7 / 8 / 9 / 10 used `≥` for the h-index bands, but `scoring.py` uses strict `>` (matches the doc fix made in #3516 for `docs/journal-quality.md`). At each boundary value the UI overstated what the threshold keeps — e.g. threshold 10 described h-index ≥ 150 keeps Nature/Science, but a journal with h=150 exactly would score 8, not 10. Pure template/string change; no JS logic touched. * fix: Round 6-7 follow-ups — thread safety, resource leak, perf (#3452) * fix: add lock around shared SearXNG engine in journal filter (Round 6) The JournalReputationFilter instance is cached inside the parallel search engine and shared across worker threads. When Tier 4 (LLM analysis) is enabled, two concurrent filter_results calls could both invoke self.__engine.run(query) on the same SearXNG instance, causing the engine's mutable bookkeeping state (_last_results_count, _search_results, rate tracker) to race. Tier 4 is disabled by default and rarely hit, so contention cost is negligible compared to the correctness guarantee. * fix: Round 7 — resource leak + perf hotspots 1. BaseSearchEngine.close() now closes _preview_filters too (journal reputation filter is registered as preview, not content) 2. __clean_journal_name memoized per batch via local dict 3. _resolve_journal_id memoized per batch via journal_id_cache * test: add savepoint isolation and _json_safe integration tests - test_batch_with_failing_source_savepoint_isolation: verifies a 3-source batch persists all 3 when using savepoints - test_json_safe_rejects_non_serializable_source: verifies a source containing a datetime object (non-JSON-safe) is correctly sanitized via _json_safe and the Paper row is persisted without crashing json.dumps() at flush time * refactor(journal-quality): collapse migrations 0006-0010 into one The journal-quality feature has not shipped, so its five-migration history (with two no-op stubs at 0008 and 0010 preserved for mid-chain dev DBs) is debt that protects a user population that doesn't exist. Collapses into a single 0006_journal_quality_system.py that creates the papers/paper_appearances tables, adds the three kept columns and two indexes to journals (with the diacritic-safe name_lower backfill), and adds ix_research_resources_research_id — the net effect of the pre-squash chain. Deletes test_journal_migration_squash.py along with its mid-chain regression tests (no longer reachable). All migration test suites pass locally (271 tests across 7 files). Dev databases on the branch stamped at 0006-0010 will need to be reset — delete the file and let the app re-initialize on next start. * fix(search): remove duplicate _preview_filters close loop The close() method iterated _preview_filters twice — once before and once after the _content_filters loop. safe_close() logs a warning on the second invocation against an already-closed resource; keep a single pass. * fix(migrations): use UtcDateTime in 0006 journal quality Migration 0006 used sa.DateTime(timezone=True) on three timestamp columns. Main's new check_datetime_timezone.py hook (commit `bab0f61b6`) rejects that pattern outside tests, so the migration would fail pre-commit on rebase. Switch to UtcDateTime with server_default=utcnow() to match the rest of the codebase. * fix(security): rate-limit journal-data download + CSRF header - Add journal_data_limit (2/hour per authenticated user) in rate_limiter.py - Decorate POST /api/journal-data/download to cap manual rebuilds - Send X-CSRFToken in the dashboard's fetch; Flask-WTF already enforces CSRF at the blueprint level, so without this header the button would start returning 400 * test(arxiv): assert journal_ref is forwarded in previews Parametrize test_paper_in_cache_no_pdf over journal_ref so the result dict's journal_ref key is checked both when absent (None) and when present (a realistic citation string). Guards against accidental removal of the forwarding added in `d88de731d4`. * fix(openalex): detect id rename, journal-only drift sample, surface SchemaDriftError Three related drift-detection gaps: - An ``id``→``source_id`` rename causes every record to be dropped at parse time, hitting the row-count floor with a generic RuntimeError that hides the cause. Track raw parse counts and raise a specific SchemaDriftError when parsed_records is healthy but parsed_with_id is zero. The check runs before the row-count floor so it wins. - The ``h_index``/``cited_by_count`` drift sample scanned all source types, which would false-trigger on snapshots skewed to conferences or other types that legitimately lack ``h_index``. Filter the sample to ``type == "journal"`` records only. - ``downloader.py`` collapsed ``SchemaDriftError`` into its class name as part of CodeQL info-disclosure hardening. Drift messages are developer-authored string literals with no SQL/path/stack content, so surface them verbatim while keeping generic exceptions scrubbed. Also updates existing drift assertions to the new "journal sample" phrasing and adds end-to-end tests for the id-rename and conference-only-snapshot paths. * test(metrics): cover journal-quality endpoints + cross-user isolation Adds targeted coverage for the four endpoints the PR introduces, plus an ownership test on the per-research endpoint: - TestApiJournalQuality: auth, per_page clamp to 200, sort-injection pass-through to the DB-layer allowlist. Mocks get_journal_reference_db so the route logic runs without triggering the lazy network-fetch build. - TestApiJournalDataStatus: auth check, dict-shape response. - TestApiJournalDataDownload: auth check, authenticated POST reaches the handler (mocked downloader, no network). - TestApiResearchJournals.test_other_users_research_id_returns_404: registers a second test user in a fresh client and confirms they cannot fetch user A's research_id — the per-user encrypted DB is the ownership boundary. Gracefully skips if multi-user registration is unavailable in the env. * fix(db): validate order param in get_institutions_page Matches the existing defensive guard in get_journals_page. The current ternary is safe via ORM (.asc() / .desc() only), but the explicit allowlist prevents future refactors from accidentally interpolating a tainted value into raw SQL. * refactor(db): drop redundant index=True on research_resources.research_id Migration 0006 already creates `ix_research_resources_research_id` on this column. Leaving `index=True` on the model means `create_all()` (e.g. in ad-hoc tests or tooling) would add a second unnamed index on the same column — wasted storage + write cost. * fix(filter): strip zero-width and bidi chars in _sanitize_name Replace the narrow C0/C1-only regex with log_sanitizer.strip_control_chars, which covers C0/C1 + Arabic letter mark + zero-width space/joiner/mark + bidi override (U+202A-E) + isolate (U+2066-9) + digit shape controls + BOM. Tier 4 (LLM) is opt-in and the score is strictly validated, so the real exploit surface is minimal — but a crafted bidi-override in a quoted journal name could still confuse LLM or log rendering. Using the comprehensive, audited pattern eliminates a regex drift point. * fix(engines): forward ISSN from PubMed and OpenAlex previews The journal reputation filter already reads `result.get("issn")` for Tier 2/3 lookups, but neither OpenAlex nor PubMed was forwarding it. - OpenAlex: extract `source.issn_l` (linking ISSN) and add to the preview dict alongside the existing `openalex_source_id`. - PubMed: esummary already extracted `issn` / `essn` into `summaries` (line 766). Forward to the preview (prefer issn, fall back to essn). NASA-ADS is not included — the esummary API we call does not return ISSN (the field list uses bibstem codes instead). Without ISSN, the filter falls back to name-only matching which is slower and less reliable on journal-name variants ("Nat Commun" vs "Nature Communications"). With ISSN the lookups hit the indexed column. * fix(filter): propagate settings-read errors in create_default The inner ``except Exception: enabled = True`` wrapped only the settings snapshot read and silently defaulted the filter to enabled if anything went wrong — a corrupted snapshot, a DB lock, an import error — all of which should surface, not be masked. Per CLAUDE.md: no silent fallbacks. Merge the inner catch into the outer one. Any error (settings read or filter init) returns None, and ``logger.exception`` records the real cause so operators can see what broke. Adds a regression test asserting create_default returns None when get_setting_from_snapshot raises. * docs(journal-quality): troubleshooting, DB management, Tier 4 cost - Add Tier 4 Cost & Latency callout (latency ~3-10s per unknown journal, ~300-500 tokens per analysis, cached 365 days by default). - Add Troubleshooting section covering the common questions: low score, missing journal, performance. - Add Database Management section with per-OS DB path, read-only enforcement notes, and force-rebuild instructions. - Rename pending release note to 1.6.0.md (current version 1.5.6; this PR bumps minor because it adds a new dashboard + changes the journals table schema). * test(migrations): dedicated upgrade/downgrade roundtrip for 0006 Migration 0006 consolidates five originally-separate revisions into one atomic change. The existing generic alembic test doesn't exercise the specific objects this migration creates. Covers: - papers table with doi/arxiv_id/pmid UNIQUEs and journal_id FK on ON DELETE SET NULL (preserves paper provenance when journals are removed). - paper_appearances join table with both FKs on ON DELETE CASCADE and resource_id UNIQUE (dedup guard at the schema level). - journals.name_lower backfill — diacritics survive Python str.lower. - upgrade → downgrade → upgrade roundtrip asserts downgrade removes every object upgrade created, and that upgrade idempotently rebuilds. The paper_appearances index test checks by column coverage rather than index name: the ORM pre-creates the table via Base.metadata.create_all elsewhere, so the migration's explicit idx_* name isn't what ends up in the DB. That's a separate pre-existing issue, not regressed here. * test(db): regression guard for get_institutions_page order allowlist Exercises the defensive guard added in commit `23b57a054`. A tainted ``order`` string must not crash or leak into SQL; the DB layer treats anything other than "asc"/"desc" as "desc", so the two calls below must return identical institution lists. Mirrors the style of test_invalid_sort_column_defaults_to_quality in TestGetJournalsPage. * fix(journal-quality): stale sentinel recovery + live download progress Two related problems the user hit on a fresh install: 1. Stale `.downloading` sentinel blocked every retry. When the download thread dies mid-way (HTTP timeout, client disconnect, SIGKILL) the `finally` cleanup never runs and the sentinel lingers. The next request got "Download already in progress" forever. Add a stale-age check (20 min > expected 7 min wall-clock) that reclaims the sentinel instead of refusing. 2. The progress UI was fake: jumped to 30% and sat there for ~7 minutes with no indication of what's happening or what source is being fetched. When the download died silently the user saw "Download failed" with zero context. Add a module-level `_download_state` dict updated at every phase transition (per-source start, DB build, success, failure). Expose it via the existing `/metrics/api/journal-data/status` endpoint under a `download_progress` key. The dashboard polls it every 2 s while a download is in flight and renders real text like "[23%] Downloading OpenAlex — source 1 of 5". Also probe the status on page load: if a download started elsewhere (background init, another tab) the dashboard shows the live progress instead of a stale "Not downloaded" banner with a fresh button. The download is still a synchronous HTTP POST (closing the tab doesn't cancel the server work), so the CTA text is updated to tell the user they can close the tab and the download continues server-side. * feat(journal-quality): parallel source downloads + per-source progress rows Parallelize the 5 source downloads via ThreadPoolExecutor; restructure the shared download state into per-source entries so the dashboard can render one progress row per source (+ a sixth for the DB build step). Each source streams from a different host (OpenAlex S3, DOAJ, GitHub raw, OpenAlex REST for institutions) so there's no single-host contention; wall-clock is now bound by the slowest source rather than the sum. release-gate.yml already uses this pattern for the integration test. Also fixes a UX bug: the journals-table API returns 503 when the reference DB isn't built yet, which the dashboard rendered as a scary red "Failed to load journal data" box. The install CTA banner above already communicates the state, so we silently ignore the 503. test_openalex_failure now mocks all 5 sources because in parallel mode non-OpenAlex workers still run (just want them to return 0 quickly). * feat(journal-quality): per-partition progress callback for live bars Dashboard feedback: the two OpenAlex sources sat at "running" for 30-60 s each with the bar showing a frozen 50% — no sense of motion even though the server logs "5/39 parts" periodically. - DataSource.fetch gains optional `progress_cb(done, total, detail)`. - openalex.py + institutions.py call it on every partition (not just every 5th like the human-readable log). - One-shot sources (doaj, jabref, predatory) take the kwarg but ignore it — they finish in <10 s so the 0 → 100 snap is fine. - downloader._fetch_one wraps the callback to write a per-source `percent` field in _download_state; the status endpoint carries it to the dashboard. - Frontend row bar uses that percent instead of the 50% placeholder it had for the running state. 11 downloader tests green; no test changes needed (mocks pass through kwargs transparently). * feat(journal-quality): pending marker when ref DB still downloading On a fresh install the search can fire before the reference DB finishes building — every journal then falls through the "no scoring data" branch and gets marked score 3 ("low-confidence unknown"), which is misleading because we don't actually know the journal is unknown; we just haven't loaded the data yet. Introduce a QUALITY_PENDING = "pending" sentinel in search_utilities. filter_results checks `data_manager.available` at the top of the batch; if False, it skips all scoring and tags each result with the sentinel instead. The tag renderer recognizes the sentinel and emits: [journal quality data still downloading — check /metrics/journals and re-run the search once the build finishes] This only fires during the narrow window between "user kicks off install" and "reference DB built" — once the DB exists, normal scoring resumes on every subsequent search. 63 filter + tag tests still green (accept string sentinel alongside int\|None). * fix(filter): probe DB file directly instead of .available The ``.available`` property on JournalQualityDB has side effects — it calls ``_ensure_engine()`` which tries to lazy-build the DB and, when a download is in flight, blocks for several minutes waiting for the build to finish. That defeats the pending-marker logic I just added: .available would eventually return True once the build completed, so the fail-soft branch never fired. Check ``journal_quality.db`` file existence directly (a cheap stat) before deciding whether to mark results as pending. If the file isn't on disk yet, we're still in the fresh-install window — skip scoring, return results with the QUALITY_PENDING sentinel. This also avoids the thundering herd of 30+ filter workers each triggering a build attempt via ``.available``. * docs(filter): clarify pending-marker copy — download in flight Earlier copy said "check /metrics/journals and re-run once the build finishes", which could be read as "the download hasn't started yet — go trigger it". Reassure the user: the download IS already running in parallel and may even be complete by the time they click through. This avoids the "did my search error out?" reaction. * fix(downloader): 30s cooldown cache on ensure_journal_data Thundering-herd guard. During a search, every search engine's reputation-filter worker (~30 threads) called ensure_journal_data concurrently. On a fresh install (no data files yet) they all raced to create the .downloading sentinel; one won, 29 got rejected and each logged a WARNING. Observed: 30 identical warnings in a single millisecond. Module-level tuple cache: (timestamp, result). Successful calls (data files already present) are still fast and uncached — that's a single stat() and the caller gets the real answer. Only the negative/"download failed or still running" result is cached, for 30 seconds. First caller does the real work; the other 29 within the window get the cached (None, False) and move on. Cache entry naturally self-expires, so subsequent batches re-check. * fix(filter): strip arxiv journal_ref edge cases + respect exclude_non_published in pending mode Two concrete gaps the fresh-install test surfaced: 1. Trailing empty parens. ArXiv journal_refs sometimes end with "()" when the citation year got stripped upstream, e.g. "Physical Review Research ()". Regex-strip whitespace-only trailing parens. 2. Truncated volume/page markers. ArXiv preview cuts citations mid-keyword: "Plasma Physics and Controlled Fusion, vol. 63" → "Plasma Physics and Controlled Fusion, v". Strip trailing ", v" / ", vol" / ", p" / ", pp" / ", no". Also refines the pending-marker fail-soft path: when exclude_non_published is True, results without a journal_ref are still dropped even in pending mode. Only venued results carry the marker. Previously the pending early-return short-circuited the exclude-non-published check and returned all results. 9 new parametrized regex cases guard the two fixes + 3 regressions. * debug(filter): log db_ready probe + pending-tag counts Make the pending-marker path visible in the log. Previous code logged a single generic WARNING without counts, so operators couldn't tell whether the path fired or which results got the marker. - Log on_disk / engine_cached / db_ready values at the probe site. - Log exception stack if the probe raises. - Log tagged / kept / dropped counts at the end of the pending branch. * fix(filter): kick off background fetch when pending path fires The pending-marker copy tells the user "by the time you check /metrics/journals it may already be complete" — but that was a lie. When I replaced the side-effect-ful ``.available`` check with the cheap file-existence probe, I also removed the code path that indirectly triggered ``ensure_journal_data``. Net result: the filter correctly tagged results as pending but never started the download. Users would see "pending" forever unless they manually clicked the Download button on the dashboard. Spawn the download in a daemon thread on first hit of the pending path. A module-level threading.Lock guards the spawn — 30 concurrent filter workers can't each start their own thread (the first one gets through, rest see ``_bg_fetch_thread.is_alive()`` and bow out). The 30-second TTL cache in ``ensure_journal_data`` is a second line of defence. Daemon thread so it doesn't block process exit. * docs(journal-quality): add 5th help step on data storage + refresh Existing 4-step panel explains scoring but says nothing about where the data lives or how to refresh it. User feedback asked for that context. Add a 5th step scoped to admins: - Path on Linux/macOS + Windows. - Explicit note that the data is shared across all users on the server — a forced refresh affects everyone. - Refresh recipe: stop server → delete files → restart → next search or dashboard visit re-downloads in the background. - Marks it as "typically an admin task; normal users don't need to refresh" to discourage casual reloads. No refresh button — it would affect all users and mid-search quality scores would disappear, which is multi-tenant hostile. The existing Download button already force-refreshes when needed. * fix(downloader): clear orphan .downloading sentinel on startup If the previous server process got killed mid-download (SIGKILL, crash, restart during a fresh install), the ``finally`` cleanup in download_journal_data never runs and the sentinel file survives on disk. The new process then sees the orphan sentinel on every retry and bows out with "Download already in progress" — but nothing is actually downloading. The user is blocked for up to 20 minutes (the _SENTINEL_STALE_SECS stale-reclaim window). A fresh process cannot own an in-flight download, so any sentinel present at module import time is by definition orphan debris. Unlink it on startup with a clear WARNING log line so operators can see what happened. Hit this during fresh-install testing: restarted the server while OpenAlex sources was still streaming; the sentinel survived; the next search's background-fetch attempt was stuck waiting for a download that no longer existed. * fix(downloader): PID-based sentinel liveness check on every call Complement to the startup orphan cleanup: even within a single server process lifetime, a download thread can crash mid-flight and leave the sentinel behind. The startup hook only catches cross-restart orphans; this handles same-process ones. Stamp the sentinel with the owner process's PID at creation time. When a new call sees an existing sentinel it: 1. Reads the PID. 2. If the PID is not alive (ProcessLookupError from os.kill(..., 0)) or the sentinel is malformed → reclaim immediately. 3. If the PID matches our own → don't nuke self; treat as alive (the module-level lock should prevent this, but err safe). 4. Otherwise → still owned, bow out with "already in progress". 5. The 20-minute age-based reclaim remains as a last-resort fallback. Update test_concurrent_download_blocked to stamp the current PID into the simulated sentinel so the liveness check returns "alive" instead of treating an empty sentinel as orphan and falling through to real network calls. * docs(journal-quality): rewrite step-5 help without HTML entities The help_step macro renders its body as plain text, so <code>...</code>, &, —, and ' showed up as literal strings in the UI. Strip the HTML and use plain Unicode characters (& instead of &, — instead of —, straight apostrophe instead of '). Inline code becomes plain monospaced-looking text — close enough given the surrounding steps have no inline-code formatting either. * feat(filter): QUALITY_PREPRINT sentinel + explicit per-score tag mapping User feedback: the [Unranked ★] tier never appeared in reports, and arxiv preprints had no tag at all — users couldn't tell the quality column was blank because "no venue" vs. because "DB failed to load". Two changes: 1. Add QUALITY_PREPRINT = "preprint" sentinel. The filter's _handle_no_venue path now sets result["journal_quality"] to this when Tier 3.5 (institution salvage) doesn't produce a numeric score. The tag renderer emits "[preprint — not in journal catalog]". 2. Rewrite _format_quality_tag with an explicit branch per score (1 through 10) instead of >= ranges. Adjusts: - Score 3 ("no scoring data" fallback) now renders [Unranked ★] instead of [Q4 ★]. Semantically correct: we don't know the venue, we're not claiming it's low-quality. - Score 4 still renders [Unranked ★] (DEFAULT for "in catalog, no h-index signal"). - Out-of-set values fall through to f"[quality={value!r}]" so a broken scoring-logic change surfaces the raw value instead of silently bucketing into Q4. Adds tests: - score 3 → Unranked (the user-visible change) - QUALITY_PREPRINT → preprint tag - QUALITY_PENDING → existing downloading message - out-of-range values surface raw in [quality=…] - every VALID_QUALITY_SCORES member maps to a real tier tag Also: downgrade() docstring gains a data-loss warning; release notes update the outdated "sources are fetched sequentially" claim to reflect the parallel ThreadPoolExecutor we shipped. * feat(journals): denormalize container_title + journal_quality onto Paper The "Your Research" tab of /metrics/journals was empty for every user in the default config. Root cause: filter's Tier 1-3 scoring (predatory/OpenAlex/DOAJ/institutions, covering >99% of lookups) reads the bundled read-only reference DB and never writes Journal rows to the per-user encrypted DB. Only Tier 4 (LLM, opt-in, default OFF) creates them. With no Journal rows, the dashboard's SELECT COUNT() FROM journals hit 0 and returned the empty state. Fix: promote two fields to first-class Paper columns: - container_title (String(500), indexed) — the cleaned name that keyed the filter's successful score. Always populated when the filter scored the journal. Dashboard GROUP BY key. - journal_quality (Integer) — populated ONLY by the Tier 4 LLM path (expensive + non-deterministic → worth freezing). Tier 1-3 scores are deterministic and recomputed live from the ref DB so the dashboard tracks upstream data updates without staleness. The existing Journal table + journal_id FK + LLM cache path are unchanged. Dashboard endpoints (/api/journals/user-research and /api/journals/ research/<id>) now group on Paper.container_title, batch-enrich via a new ref-DB helper lookup_sources_batch (one SQL round-trip per page load instead of N per-row lookups), and pick quality via a precedence order: frozen LLM verdict → live ref DB score → NULL if neither. Migration 0006 modified in place — it hasn't shipped yet. Dev DBs stamped at 0006 need to be reset (as the migration's existing header already notes). fix(scoring): cap preprint repositories at ACCEPTABLE (5) arXiv (Cornell) and other preprint repositories were being rated Q1 ELITE (10) because derive_quality_score treated any source_type the same — arXiv has h_index=674 + Q1 in OpenAlex, so it hit the elite branch. But repositories aren't peer-reviewed: their h-index reflects aggregate citation accumulation across the thousands of papers they host, not venue rigor. Fix: short-circuit source_type=="repository" to REPOSITORY_QUALITY_DEFAULT (5) right after the predatory check, same pattern as conferences. The filter's existing Tier 3.5 institution salvage can still lift this to 6 when authors are at a strong institution. Q-tier semantics stay meaningful for the 234K real journals in the ref DB. Bumped JOURNAL_DATA_VERSION v3→v4 so existing installs rebuild the ref DB and pick up the corrected scores for the ~6,789 repository entries. * fix(normalizer): strip 'unknown' placeholder from container_title OpenAlex and NASA ADS search engines emit journal="unknown" when the upstream record has no venue indexed. The citation normalizer's waterfall fallthrough picked that up as container_title, which then (a) leaked into Paper.container_title so the dashboard showed a literal "unknown" row grouping across papers from multiple real journals, and (b) matched a real OpenAlex source actually named "unknown" (Q1, h_index=5, score=8) during name-based ref-DB lookup, producing a nonsensical Q1 rating. Fix at both layers: - OpenAlex + NASA ADS engines now emit None for both journal and journal_ref when the underlying venue is missing, matching what journal_ref already did. - Normalizer strips literal "unknown" / empty values from the container waterfall defensively in case any other engine ever emits the same sentinel. Covers "Unknown" / "UNKNOWN" / whitespace-padded variants. * fix(journals): tag score source + fail-closed predatory on filter crash Two correctness blockers surfaced by the post-rewire audit. B2 — llm_cached gate was unsound. save_research_sources used `journal_id is not None` as the gate for persisting Paper.journal_quality, but _resolve_journal_id matches by name_lower alone — so any prior LLM- enabled session's Journal cache row made the gate True, and a subsequent Tier-2 score got stamped as if it were an LLM verdict. Violated the "only LLM verdicts are frozen" invariant documented at citation.py:49-54. Fix: __score_journal now returns (score, source_tag) where source_tag identifies which tier produced the value — "openalex", "doaj", "institution", "llm" (Tier 4 live scoring OR cache hit on a prior LLM row), "conference", "low_confidence", or None for predatory. The filter attaches source_tag to each scored result, and save_research_sources gates journal_quality persistence on `source_tag == "llm"` instead of the FK-presence check. S4 — filter top-level except re-admitted predatory. The catch-all at the end of filter_results returned `results` (raw input) instead of `filtered` (predatory-free), so any Tier 1 auto-removed predatory journals would leak back into the output when the filter hit an unexpected exception mid-batch. Fix: return `filtered` instead. Initialize `filtered = []` before the try so the except branch can always reference it even if the crash fires before Pass 1 populates anything. Losing in-flight non-predatory results is preferable to breaking the predatory-removal safety contract. * fix(journals): schema-drift, container_title dedup, stale-version warn B.1 — drop server_default=utcnow() from migration 0006's papers + paper_appearances timestamp columns. The Paper model uses Python-side default=utcnow() + onupdate=utcnow() (citation.py:102-105), but 0001_initial_schema.py's create_all() path renders tables from the model (no SQL-level default), while the migration-replay path was getting server_default. Two environments, two schemas. Align on the client-side default. B.2 — pop container_title from citation_fields before building the Paper row so the value lives only in the indexed column, not duplicated into the paper_metadata JSON blob. The CSL-JSON exporter already captures the raw value inside citation_fields["csl_json"] during normalize_citation, so bibliography export is unaffected. B.3 — add stale-data-version warning. JOURNAL_DATA_VERSION bumps (v3→v4 in the repository-cap fix) were silently unnoticed by any code path except the admin dashboard banner: _ensure_engine only checked PRAGMA user_version (schema), not version.json (data). The filter hot path served stale scores until a user visited /metrics/journals. Now _warn_on_stale_data_version fires once per engine lifetime at WARNING level — no auto-rebuild (user consent via the dashboard's Download button remains the explicit refresh), just visibility. B.4 — drop idx_paper_appearances_paper from the migration. The model's index=True on PaperAppearance.paper_id is the single source of truth, matching the existing ResearchResource.research_id pattern at research.py:186-189. C.3 — docstring polish + FP-protection comments so future audits don't re-flag these as bugs. * fix(journals): failed-count log + Journal module + name_lower UNIQUE B.5 — save_research_sources now tracks per-source failure count and emits a summary WARNING at end-of-batch when drops occurred. The broad per-source except is intentional (isolation), but previously `saved_count` couldn't distinguish "all saved" from "some silently dropped". C.1 — move `Journal` out of `logs.py` into its own `database/models/journal.py`. Re-export from logs.py keeps the existing `from ...database.models.logs import Journal` compat path used by test_schema_stability.py. C.2 — add UNIQUE on `Journal.name_lower`. Two rows with different- cased `name` values (e.g. "Nature Medicine" vs "NATURE MEDICINE") would both pass the existing `name` UNIQUE check while agreeing on `name_lower`, splitting the LLM cache. Narrow but real because the Tier 3.6 LLM-relabel path can produce different casings. Migration 0006 pre-dedupes case-folded `name` collisions BEFORE the batch_alter_table column add — SQLite enforces the new UNIQUE during the table-copy step of batch_alter, so collision cleanup had to happen first. Keep lowest id per group (first-writer-wins); the cache is reproducible. Migration uses SQLAlchemy Core (reflected Tables + sa.select / sa.update / sa.delete) rather than raw sa.text() strings per project preference. * test(downloader): stamp live PID in sentinel fixtures Two disk-check tests broke when the PID-based sentinel liveness check shipped (commit `f4cfc9d25`) — they ``touch()``'d an empty ``.downloading`` file, which the new ``_sentinel_owner_alive`` correctly treats as orphan (empty read_text().strip() fails int() parse → ValueError → orphan). The downloader then reclaims the sentinel and doesn't short-circuit with "already in progress". Fix: add ``_stamp_live_sentinel`` helper that writes the current PID so the liveness check sees an alive owner and the download refuses as expected by the test's assertion. Pre-existing failure, not from this audit's work — spotted while running the broader regression suite. * test(journals): update fixtures for new signatures + safety contract Fixes 12 CI test failures introduced by this audit's changes: - nasa_ads engine tests (2) — updated to expect ``None`` (not the ``"unknown"`` literal) when no pub/bibstem is available. The engine now emits None at both ``journal`` and ``journal_ref``; the old sentinel was leaking through the normalizer's container_title fallback and matching a real OpenAlex source named "unknown". - schema parity test (1) — added explicit ``UniqueConstraint(..., name="uq_journals_name_lower")`` in the Journal model's ``__table_args__`` so ``compare_metadata`` sees the same constraint name the migration creates. Without the explicit name, SQLAlchemy auto-generated a different constraint name and ``test_migrations_produce_schema_matching_models`` reported drift. - coverage + tiers tests (~9) — the filter's ``db_ready`` probe was blocking every scoring-path test in CI (no ``journal_quality.db`` file present). Added an autouse fixture to the filters directory's conftest that patches ``Path.exists/stat`` for that specific file so the probe returns True. Individual tests can still override if they want to exercise the pending path. - 2 tests of the old safety-contract inversion — renamed and updated to expect ``filtered`` (predatory-free) instead of the raw input list on filter crash. The S4 fix in this PR's main commits changed that behavior deliberately to prevent predatory re-admission. Merge `main` into the branch picked up 5 unrelated commits; no conflicts. * fix(journals): log dropped DOAJ Seal +1 bump When the LLM score is 8 and the journal has the DOAJ Seal, the +1 bump lands on 9 — which is not in VALID_QUALITY_SCORES {1,4,5,6,7,8,10} — so the bump is dropped. Previously this was silent, hiding the fact that the Seal had no effect on Strong-tier journals. Add a debug log so operators can see the skip, and a regression test locking the behavior in. * fix(journals): clamp echoed dashboard page to total_pages An attacker could request /api/journals?page=10*9 and the route would echo the unbounded page number in the JSON response, making the UI render nonsense pagination state. SQLite's OFFSET on the indexed ORDER BY caps work at total rows so there is no DoS, but the UX bug is real. Clamp the echoed page at the route layer (no DB-method signature change) and reuse the already-computed total_pages. chore(hooks): make mode=ro readonly regex case-insensitive SQLite accepts case-insensitive URI parameter values, so mode=RO, mode=Ro, etc. are all valid read-only opens. The pre-commit hook's regex was case-sensitive and would have missed those forms. Add the IGNORECASE flag and cover the new forms with tests. * a11y(journals): add th scope and sr-only labels on dashboard The journal-quality dashboard tables were missing scope=\"col\" on their <th> cells, so screen readers could not announce column context for each data cell. The filter inputs (search box + tier/source selects) also had no associated <label>, leaving them unnamed for assistive tech. Use the existing .sr-only class from styles.css. * chore(templates): use url_for for journals link in metrics.html The sidebar already routes the Journals nav via url_for(). The metrics.html nav bar was the lone outlier with a hardcoded path, which would silently break if the route prefix ever changed. * docs(release): note dashboard 503 and filter-warmup trick on first launch The Journals dashboard page loads fine on a fresh install, but the /api/journals data endpoint returns 503 until the reference DB finishes building. Document the exact response and the warmup tip: kick off a research request in parallel to spawn the background build thread. * test(journals): tier fallthrough and short-circuit regression tests Two new regression tests close gaps in the tier-pipeline coverage: 1. When every tier (predatory, OpenAlex, DOAJ, institution) misses and Tier 4 LLM scoring is disabled, the low-confidence floor should tag the result with score=3 and source='low_confidence'. Guards the only explicit 'no data at all' output path. 2. When Tier 2 (OpenAlex) produces a score, later tiers (DOAJ, institution salvage) must not run. Asserts call_count==0 on the downstream lookups so any future refactor that accidentally unconditionally calls them is caught. * fix(journal-quality): merge-readiness polish + pytest scheduler teardown - Docstring: DOAJ Seal → 8 (was stale "→ 6") in advanced_search_system/filters/journal_reputation_filter.py. Constants, scoring.py, docs/journal-quality.md, the dashboard template, and tests all already use 8. Closes the outstanding docstring-accuracy thread. - Dashboard: allow `quartile` as a sort column in journal_quality/db.py `_SORT_COLUMNS` allowlist. The clickable "Quartile" header in templates/pages/journal_quality.html silently fell back to sort-by-quality because the backend rejected the column. `quartile` is indexed (models.py:64) and get_journals_page already applies .nulls_last(). - Docs: docs/journal-quality.md says "Analytics → Journals" to match the actual sidebar section (components/sidebar.html:71); release notes were already correct. - CI: drop phantom `journal_data_downloader.py` whitelist entry from .github/scripts/check-file-writes.sh — file does not exist; real path `journal_quality/downloader.py` is already matched on the same line. - Style: collapse redundant `except (ValueError, Exception)` → `except Exception` in Tier 4 of the filter (`ValueError` is a subclass). - Tests: stop BackgroundJobScheduler before dropping its singleton in the `reset_all_singletons` autouse fixture, so the APScheduler thread does not emit to a closed pytest stderr sink during teardown. Fixes the "ValueError: I/O operation on closed file." failure on "All Pytest Tests + Coverage" that this PR's expanded test count reliably reproduces. * fix(migration): NFKC-normalize name_lower; highest-quality wins dedupe The migration backfill and the filter's cache-write paths previously used bare str.lower() while the reference-DB scoring.normalize_name() uses NFKC+lower+strip. For names with Unicode compatibility characters (e.g. "Physics Letters TM"), these produce different name_lower values, causing silent cache misses and — when a normalized form ever meets a bare form — UNIQUE-constraint violations that would abort the upgrade. Also fixes the dedupe tiebreaker: previously picked lowest-id (first- writer-wins), which can discard a quality=9 LLM verdict in favor of an older quality=5 row. Now sorts by -quality (highest first), then id ASC. Changes: - migration 0006: import unicodedata; NFKC-normalize the dedupe grouping key and backfill expression; select quality column and rewrite dedupe sort to prefer highest-quality row with lowest-id tiebreaker. - filter: import normalize_name from journal_quality.scoring; replace three call sites of name.lower() in the cache-write path. - tests: flip assertion in existing dedupe-collision test (now verifies highest quality wins); add NFKC roundtrip test, NFKC-variant dedupe test, downgrade-preserves-data test, and filter NFKC-import guard. * fix(schema): align paper_id + research_resources indexes across migration and model The model declared index=True on PaperAppearance.paper_id (citation.py:159) but the migration never called op.create_index for it, so alembic-upgrade paths had no index while create_all paths did. Similarly, research_id had the opposite asymmetry: migration created ix_research_resources_research_id but the model avoided index=True with a stale comment, leaving create_all paths without the index. Result: 20+ call sites filtering by research_id ran full-table scans on fresh installs / test fixtures. Changes: - migration 0006: add explicit op.create_index for ix_paper_appearances_paper_id with _index_exists idempotency guard - research.py: replace stale comment with __table_args__ that declares Index("ix_research_resources_research_id", "research_id") so both paths produce the same named index - tests: assert named paper_id index exists after migration; add create_all coverage test for research_resources index * fix(dashboard): escape data source fields in renderSourcesBanner Template interpolated s.name, s.url, s.license, s.license_url, and s.dataset_url raw into innerHTML; s.description had only a partial "<" escape. DataSource attribute values come from hardcoded Python string literals today, so this is defense-in-depth rather than an exploitable vuln — but any future DataSource subclass whose fields originate from network or DB input would become a stored XSS vector. Changes: - Add safeHref() helper next to escHtml(): allowlists http(s):, mailto:, and rooted paths. .trim() + ^ anchor reject leading- whitespace javascript:/data: bypasses. Returns '#' on failure (never '#...' — fragment-injection vector). - renderSourcesBanner: wrap text interpolations with escHtml(), URL interpolations with safeHref(). Drop the intermediate `desc` variable and its incomplete .replace(/</g, '<') — escHtml() handles all five dangerous characters. - Add a function-level comment establishing the invariant: every s.* field MUST go through escHtml or safeHref. Also documents why the IntegrityError retry branch in research_sources_service (lines 228-268) is not unit-tested: a mock-based approach hits PendingRollbackError before the retry runs, because SQLAlchemy savepoint rollback does not fully reset session state after a constraint violation. A real concurrency test would need threading infrastructure that does not exist in this suite. * docs(code): annotate known-deferred issues at their sites Adds KNOWN-DEFERRED comments at each site that the 5-round review flagged as lower-priority, so future reviewers understand the reasoning instead of re-investigating: - metrics_routes.py: unbounded SELECT DISTINCT container_title (reject .limit because it silently undercounts predatory journals); MAX journal_quality aggregation semantics (stability-over-freshness by design, not a stale-score bug); DEBUG log left in during development. - citation.py: doi String(255) length rationale (CrossRef recommends <=200; pathological >2000 chars fails insert rather than corrupts); source_engine retained for future per-engine analytics; resource_id UNIQUE semantics (one resource → one paper, intentional). - journal.py: name index=True redundant with unique=True, deferred; name_lower index=True redundant with UNIQUE constraint, deferred; score_source always "llm" today, retained for future multi-source. - journal_quality/models.py: quartile index=True unused today; Institution.impact_factor always NULL from OpenAlex. - 0006 downgrade: uq_journals_name_lower not explicitly dropped — SQLite batch_alter_table rebuilds the table anyway; Postgres would need drop_constraint, tracked as portability follow-up. - constants.py: invariant that score 9 is intentionally absent from VALID_QUALITY_SCORES, paired with a matching note on the dead branch in search_utilities._format_quality_tag. - sidebar.html: aria-label accessibility TODO; added aria-hidden on the icon so this commit actually improves screen-reader output. - docs/journal-quality.md: 212K vs 280K number reconciliation note. * test(migration): update rebuild data-preservation test for NFKC backfill The migration's Step 3 backfill now uses NFKC + lower + strip (see 0006_journal_quality_system.py and `f6cb349a0`). The existing test asserted row.name_lower == seed_name.lower(), which is bare lowercase and left surrounding whitespace intact — assertion held for the old buggy behavior. Add a _expected_name_lower helper that mirrors the migration's backfill expression so the assertion locks in NFKC semantics rather than bare .lower(). This is the same invariant tested in test_migration_0006.py::test_backfill_nfkc_roundtrip at a different granularity (100 mixed-Unicode rows through the full migration chain, not a single row through step-3 alone). * test(openalex): expect None for missing venue in _format_work_preview The "unknown" sentinel is intentionally stripped at the engine boundary so it never reaches the citation normalizer or matches a real OpenAlex source named "unknown" (Q1, h_index=5). Tests were stale — update both to match the documented contract. * fix(nasa-ads): preserve "Last, First" author pairs through to CSL normalizer NASA ADS returns each name as "Last, First". The previous code comma-joined them for display, then citation_normalizer split that string back on commas — turning two authors into four literal singletons. Add a structured authors_csl field at the engine boundary and have normalize_citation prefer it over the display string fallback. * fix(institutions): skip malformed JSON lines instead of aborting fetch Mirrors the openalex.py pattern: a single bad line in any partition must not kill the whole monthly rebuild. Wrap json.loads in try/except (json.JSONDecodeError, ValueError); count + log first 10 malformed lines, suppress further warnings; the existing _MIN_INSTITUTIONS floor still aborts if too many records were lost. * fix(metrics): return 400 for non-integer page/per_page params Previously a query like ``?page=abc`` raised ValueError out of the ``int(...)`` calls, which the broad outer except caught and turned into a generic 500. Wrap the conversion in a narrow try/except so client mistakes surface as 400 (Bad Request) with a clear message, and keep the outer 500 path for genuine internal errors. * fix(institutions): NFKC-normalize names in build and lookup paths Canonical name normalization is normalize_name (NFKC + lower + strip) in journal_quality/scoring.py — used for sources, predatory tables, and abbreviations. Institutions diverged: bare .lower().strip() was applied symmetrically on both writer and reader sides, so lookups worked for ASCII but Unicode-equivalent inputs (ligatures, fullwidth, NFKD-decomposed accents) silently missed across the index. Replace the bare normalization at every institution writer/reader site with normalize_name() to match the canonical contract. Snapshot rebuild on next data download will re-normalize stored name_lower values; intermediate lookups remain symmetric. * test(quality): make test_orm_imports_used assert; clarify mock test docstring test_orm_imports_used previously only printed a count and never asserted — a phantom test that could never fail. Add a sanity check that DB-operation patterns still match anything, plus an 80% ratio guard so a regression where files stop using the ORM would surface. Also clarify test_save_research_sources_success: the 1:1 add-count holds only for non-academic URLs (the test inputs). Academic sources trigger a 3:1 add ratio (ResearchResource + Paper + PaperAppearance); that path is integration-tested in test_paper_dedup_integration.py. * fix(ui): use local escape helpers consistently in details.js and journal_quality.html details.js defines escapeHtml/escapeHtmlFallback as a closure at the top of the file, then ignores it 130 lines down by using ``window.escapeHtml ? window.escapeHtml(x) : x`` ternaries. The intent was a fallback when the global helper hasn't loaded — but the local closure already provides that fallback, so the ternary's else-branch silently emits unescaped HTML when window.escapeHtml is missing. Switch to the local escapeHtml so escaping is unconditional. journal_quality.html: ``${t.label}`` interpolated into innerHTML without escHtml. Numeric today, but the explicit escHtml(String(...)) contract guards against future API changes that emit a string field under the same name. * chore(ci): align journal-data-integration action pins with rest of repo The new workflow pinned harden-runner@v2.16.0 and setup-pdm@v4.4 while every other workflow in the repo uses v2.17.0 / v4.5. Align both pins so the audit trail across the 50+ workflows stays consistent and the new workflow picks up the same upstream fixes. * fix(quality): narrow LLM exception handling and add predatory min-record floor __llm_clean_journal_name caught bare Exception and logged at DEBUG — silently absorbed every failure including programming errors that deserve a stack trace. Narrow to the recoverable network/parse errors (ConnectionError, TimeoutError, ValueError) and surface them at WARNING so they're visible during triage. Log the exception class name only (not the message) to satisfy the sensitive-logging hook. Predatory data source previously wrote whatever it fetched, even if the upstream returned 0 rows on 2 of the 3 CSVs. That silently disabled predatory filtering for everyone. Add a 100-entry floor that raises before overwriting the on-disk snapshot — the previous good build stays in place when the upstream is partially broken. * feat(papers): promote publication year to indexed first-class column Year is a natural filter/group axis for the journal dashboard — "papers in journal X from 2020-2024" — but living inside the metadata JSON blob meant every such query paid for json_extract on every row and could not use an index. Migration 0006: papers.year INTEGER NULL + idx_papers_year added at table-creation time. No in-place upgrade branch for pre-release installs — keeps the migration simple; a fresh install or clean re-stamp reaches the right schema. Model: Paper.year declared alongside the other indexed columns; kept ALSO in paper_metadata JSON so the CSL-JSON blob stays complete and existing JSON readers keep working. Write path: save_research_sources now copies citation_fields["year"] into indexed["year"] (column) while leaving the original in the metadata blob. _merge_identifiers uses the same first-write-wins semantics already applied to doi/arxiv_id/pmid. Dashboard: per-research and user-aggregate journal endpoints now return year_min/year_max per journal (MIN/MAX over Paper.year), and the per-research table gains a "Years" column rendering "2020–2024" or "2023" or "—". * fix(search): run OpenAlex enrichment before preview filters so Tier 2 can use source_id The JournalReputationFilter is registered as a preview filter on every scientific engine (arxiv, pubmed, openalex, nasa_ads, semantic_scholar) and uses result["openalex_source_id"] for Tier 2 journal lookups (filter.py:868). Previously enrich_results_with_source_ids ran AFTER _get_full_content — after the preview filters had already fired with empty source_ids. Tier 2 silently degraded to fragile name matching. Move the enrichment step between _get_previews and the preview filter loop so the field is populated by the time the filter reads it. Non-scientific engines still skip the enrichment entirely. * docs(filter): clarify __clean_journal_name is regex-only, not LLM (djpetti review) The prior docstring read "Uses regex ... followed by JabRef abbreviation expansion ... the expensive Tier 4 LLM result is cached at the DB layer instead" which implied this method coordinates with or includes the LLM path. It does not — __llm_clean_journal_name is a separate salvage step invoked only when bundled tiers miss and enable_llm_scoring is on. Update the docstring to state explicitly: this method is regex-only and returns unexpanded abbreviations / location suffixes unchanged; the LLM path is separate and opt-in. * refactor(data-sources): extract shared manifest iteration helper (djpetti review) openalex.py and institutions.py both download OpenAlex S3 snapshots and shared identical code for: - manifest URL allowlist validation - per-partition tmp-file download + cleanup lifecycle - per-line malformed-JSON suppression (first-10 warnings + 1 notice) Centralize in _openalex_common.py via ``validate_manifest_entries`` and ``iter_partitions`` so the two callers stay aligned (and can't drift) on the lifecycle and suppression policies. Each caller still owns its own record-handling logic, progress reporting, and per- source floor checks — those have caller-specific state that doesn't belong in the helper. Adds tests/journal_quality/test_openalex_common.py covering the helper directly (allowlist accept/reject, per-partition yields, tmp cleanup on happy path AND on exception, malformed-line suppression). * docs(quality): record design decisions for predatory threshold and CodeQL-reviewed sites Three places attracted repeat attention during PR #3081 review but landed with "keep as-is" decisions. Drop a comment at each site so future reviewers (human or AI) don't re-derive the same conclusion. 1. PREDATORY_WHITELIST_HINDEX (constants.py): h-index is not an evidence-based predatory signal per mBio 2019 / PMC 2020. Tuning the `>` / 10 boundary changes behavior only at the boundary and has no literature support. Real improvement is more signals (JCR, OASPA), not this constant. 2. _normalize_doi (openalex_enrichment.py): the anchored ``startswith`` pattern is the CodeQL-recommended mitigation for py/incomplete-url-substring-sanitization. A prior bot comment (alert 7635) against an older snapshot is no longer raised; refactoring to bare-first is equivalent for every URL shape OpenAlex actually returns. 3. Journal-download success response (metrics_routes.py): ``message`` is trace-free by construction (downloader.py guarantees class-name-only for exception derivatives). CodeQL alerts 7650/7684 cited by a stale bot comment are no longer raised; replacing with a fixed literal would regress the dashboard popup which renders the per-source counts verbatim. * docs(ci): document why journal_quality is in check-file-writes allowlist Adds a block comment above the allowlist regex explaining that each entry writes to disk without encryption by design, what kind of data it writes, and the rule for adding new entries (public data, not user-specific, justification required). * refactor(citation): drop Paper.journal_quality; resolve quality live A frozen per-paper Tier 4 score creates a real staleness footgun: if the LLM re-scores a journal later (new model, manual override, bug fix) the per-paper snapshot goes stale and the only way to fix it is to delete and re-ingest. Resolve current quality live in the dashboard: - Tier 4: batch-look up the user's journals.quality by NFKC-normalized container_title after the container_title GROUP BY aggregation. - Tier 1-3: bundled reference DB (unchanged path). The papers table is brand new in migration 0006, so we remove the column from the migration and model rather than creating it and dropping it later. Inline comments in both files document the deliberate absence so the column isn't re-introduced. User-visible behavior: unchanged. The UI only ever shows a single resolved quality — it never distinguished frozen vs live. Responses still emit "quality" and "score_source" with labels llm/openalex/doaj. Tests: removes three Paper.journal_quality persistence tests in TestMergeIdentifiersJournalColumns (first-write-wins no longer applies to a column that doesn't exist), renames the column-nullable migration test to test_container_title_nullable, and adds a test_papers_has_no_journal_quality_column regression guard. --------- Co-authored-by: Daniel Petti <djpetti@gmail.com>	2026-04-20 23:28:03 +02:00
LearningCircuit	bab0f61b66	chore(hooks): require UtcDateTime in migrations too (#3523 ) Tighten check-datetime-timezone so the UtcDateTime rule applies to both models and migrations. Supersedes the inverted approach in #3515, which tried to accept sa.DateTime(timezone=True) inside migrations. - Rewrite the AST walker: handle sa.Column / bare Column, positional type arg at any index, bare Column(UtcDateTime) without parens (the hook's own example), and ast.IfExp with both branches inspected independently so a violation in either arm is still flagged. - Anchor the path filter on src/local_deep_research/ to stop false-positives on tests/database/models/ and partial-name matches like database/models_backup/. - Update .pre-commit-config.yaml name/description and the stale CI_CD_INFRASTRUCTURE.md hook table entry. - Add tests/hooks/test_check_datetime_timezone.py with 20 cases: violations (models / migrations / conditional types / batch runs / bare names), allows (UtcDateTime with import, combo import order, empty / syntax-error files), and path-filter boundaries.	2026-04-18 21:47:17 +02:00
LearningCircuit	12160e26e1	chore(lint): add ruff rules for logging, performance, exceptions, and print detection (#3211 ) * chore(lint): add ruff rules for logging, performance, exceptions, and print detection Add wave 2 lint rules: G, PERF, RET, TRY, T20, C4, ERA. All existing violations are suppressed via ignore/per-file-ignores so this config change is merge-safe. Follow-up PRs will fix violations and remove the ignore entries incrementally. * fix(lint): exempt pre-commit hooks from T201 print rule (#3270) Pre-commit hooks are CLI scripts where print is the intended output interface, same as scripts/ and cli/ directories already exempted. * fix(lint): fix all low-count ruff violations instead of suppressing them (#3275) * fix(lint): replace manual dict-building loops with dict comprehensions (PERF403) * fix(lint): replace bare Exception raises with specific built-in types (TRY002) Replace all `raise Exception(...)` in production code with appropriate built-in exception types: RuntimeError for operational/state failures, ValueError for invalid data, and ConnectionError for HTTP errors. * fix(lint): resolve TRY004 and PERF402 ruff violations Use TypeError instead of ValueError for isinstance/issubclass type checks (TRY004), and replace manual for-loop list copies with list.extend() (PERF402). * fix(lint): fix all low-count ruff violations instead of suppressing them Fix all violations for 15 ruff rules that had ≤10 occurrences each, rather than suppressing them with ignore directives: - TRY002: raise-vanilla-class → use specific built-in exceptions - TRY004: type-check-without-type-error → use TypeError - C408: unnecessary-collection-call → use dict/list literals - C401: unnecessary-generator-set → use set comprehensions - C416: unnecessary-comprehension → use list()/set() - C414: unnecessary-double-cast-or-process → simplify - PERF403: manual-dict-comprehension → use dict comprehensions - PERF102: incorrect-dict-iterator → use .values()/.keys() - PERF402: manual-list-copy → use list.extend() - RET503/RET506/RET507/RET508: superfluous else after return/raise/continue/break - RET501/RET502: unnecessary/implicit return None Adds per-file-ignores for tests/ and examples/ where these patterns are acceptable (e.g. bare Exception in tests, dict() calls in fixtures). * fix(lint): enforce E722, ERA001, RET505 and fix pre-commit RET503 gap (#3276) Remove three rules from the global ignore list by fixing all violations: E722 (bare except) — 6 violations in tests: Replace `except:` with `except Exception:` to avoid swallowing KeyboardInterrupt and SystemExit. ERA001 (commented-out code) — 25 violations: Delete 18 true positives (dead variables, disabled debug logs, commented-out imports). Add `# noqa: ERA001` to 7 false positives (template instructions, type annotations, documentation comments). RET505 (superfluous else after return) — 413 violations: Auto-fix all occurrences. Also fixes 5 cascading RET506/RET507 violations exposed by the RET505 removals. Pre-commit hooks gap: Add RET503 to `.pre-commit-hooks/*` per-file-ignores alongside T201. fix(lint): enforce RET504 and TRY301 — fix all violations (#3279) * fix(lint): enforce RET504 — collapse unnecessary assign-before-return Auto-fix all 46 RET504 violations via ruff unsafe-fixes: collapse `result = expr; return result` into `return expr`. Remove RET504 from global ignore list. Add to tests/examples per-file-ignores where intermediate variables aid test clarity. Also removes TRY301 from global ignore (violations fixed in next commit). * fix(lint): enforce TRY301 — fix raises inside broad try/except blocks Structural fixes for 65 TRY301 violations: Security-critical fixes: - url_validator.py: move 6 validation raises before try block, replace isinstance-based re-raise with specific except clause - path_validator.py: move validation outside try block - env_settings.py: separate parsing (try) from validation (outside) Route/service fixes: - research_routes.py: replace raise-then-catch with direct error return - mcp/server.py: move all 7 tool validations before try blocks - news/api.py: move validation before try, noqa for db-session raises - notifications: move rate limit and URL validation before try blocks - iterative_refinement_strategy.py: move JSON validation after try Added noqa for intentional patterns: re-raise in except handlers, nested function definitions, db-session-dependent checks, rate limit re-raises for base class retry logic. * merge: resolve conflicts between wave2 lint branch and main Resolve 14 merge conflicts by always starting from main's version and re-applying lint fixes on top: - mcp_strategy.py, ollama.py, security_settings.py, delete_routes.py: Take main's code, re-apply RET505 (remove else: after return) - mcp/server.py (3 conflicts): Take main's ValidationError handlers and set_settings_context, re-apply TRY301 fixes, fix sensitive data logging - research_routes.py: Take main, fix duplicate block (merge artifact) - settings_routes.py: Take main's default-settings fallback feature - meta_search_engine.py, parallel_search_engine.py: Take main's get_available_engines delegation, delete unreachable code - search_engine_ddg.py, search_engine_google_pse.py: Take main's sanitization, re-apply RET506 (if not elif after raise) - rag_routes.py: Accept main's deletion (route moved to delete_routes) - encryption_check.py: Accept main's deletion (dead code) - test_storage_coverage.py: Remove broken test classes referencing undefined stubs - pre-commit hooks: extend per-file-ignores for ERA001, RET504 * fix: revert ValueError→TypeError changes that break tests and API contracts Revert TRY004 fixes in 3 files where changing ValueError to TypeError would break existing tests and HTTP status code contracts: - card_factory.py: 5 tests assert pytest.raises(ValueError) - base_rater.py: flask_api.py catches ValueError for HTTP 400 responses; TypeError would fall through to HTTP 500 - full_search.py: test asserts pytest.raises(ValueError) Add # noqa: TRY004 to suppress the lint rule on these lines. * fix: move benchmark_data check back inside try block The ValueError for missing benchmark_data must be inside the try/except so the except handler can mark the run as FAILED in the database. Without this, the exception propagates unhandled in a daemon thread, leaving the benchmark run stuck in RUNNING state permanently. * chore(lint): remove ERA rule and suppress TRY004 globally Remove ERA (eradicate — commented-out code detection) from ruff select: - 28% false positive rate in our codebase (7 of 25 violations) - No major Python project enables it (Django, FastAPI, Pydantic, Airflow) - Ruff itself doesn't use it; autofix was demoted to manual-only - 172 noqa suppressions provided zero enforcement value Suppress TRY004 (type-check-without-type-error) globally: - Ruff maintainer agreed the autofix "can change functionality" - We already had to revert 3 TypeError changes that broke tests and HTTP 400→500 API contracts - Django, Flask, pandas all use isinstance + ValueError routinely - Pylint has no equivalent rule; near-zero PyPI adoption Remove all 173 # noqa: ERA001 and 49 # noqa: TRY004 comments from the codebase — no longer needed with rules disabled/suppressed. * fix: resolve mypy errors, failing MCP test, and TRY301 noqa - search_engine_factory.py: restore typed intermediate variable to fix mypy no-any-return (RET504 collapse lost the type annotation) - search_engine_pubchem.py: add explicit list[str] type annotation - test_edge_cases.py: fix assertion that expected engine name in sanitized error message - mcp/server.py: add noqa: TRY301 to validation raises inside try blocks (from main's new merge code)	2026-03-29 17:01:23 +02:00
LearningCircuit	b28c80466c	refactor: cleanup remaining verified dead code across 5 areas (#3263 ) * refactor: cleanup remaining verified dead code across 5 areas Dead templates, functions, storage ABCs, eslint duplicate, dev scripts. All verified by 40 agents (20 scanning + 20 verification). * revert: keep 3 dev scripts that have active references - regenerate_golden_master.py: called by pre-commit hook .pre-commit-hooks/check-golden-master-settings.py - restart_server.sh: documented in API testing guide, examples, and multiple README files - run_tests.py: referenced in CONTRIBUTING.md testing section Added inline comments noting the references so future cleanup attempts don't remove them without updating dependents. * revert: keep restart_server_debug.sh dev script * revert: keep debug_pytest.py and stop_server.sh dev scripts Small utility scripts that cost nothing to keep and are useful for developers debugging CI failures and managing the dev server. * docs: add do-not-remove comments to all dev scripts Each script now documents why it must be kept: - regenerate_golden_master.py: pre-commit hook dependency - restart_server.sh: documented in API guides and examples - restart_server_debug.sh: companion to restart_server.sh - run_tests.py: referenced in CONTRIBUTING.md - debug_pytest.py: developer utility for CI failure reproduction - stop_server.sh: companion to restart_server.sh	2026-03-28 16:03:21 +01:00
LearningCircuit	9988f70318	refactor: remove fallback LLM (FakeListChatModel) from all providers (#2717 ) * cleanup: remove @pytest.mark.requires_llm decorators and fallback LLM doc references Remove the `@pytest.mark.requires_llm` decorator from all test files since the fallback LLM infrastructure is being removed. Update docs to remove references to `LDR_TESTING_USE_FALLBACK_LLM` and `LDR_USE_FALLBACK_LLM` environment variables from troubleshooting and CI configuration tables. * test: remove fallback LLM references from test files Remove all fallback-related test code: TestGetFallbackModel classes, FakeListChatModel assertions, check_fallback_llm parameters, and LDR_USE_FALLBACK_LLM skipif markers. Replace fallback-returning tests with ValueError-expecting tests for missing API keys and unavailable providers. * cleanup: remove remaining use_fallback_llm references from source and tests Remove use_fallback_llm() imports and calls from db_utils.py and rate_limiting/tracker.py. Clean up test files that referenced check_fallback_llm, get_llm_setting_from_snapshot, and LDR_USE_FALLBACK_LLM env var. * cleanup: remove remaining fallback LLM references from test files Remove all use_fallback_llm mocks, LDR_USE_FALLBACK_LLM env var checks, and related skip logic from test files since the fallback LLM feature has been removed from source code. - test_db_utils.py: Remove use_fallback_llm mock patches from 4 tests - test_rate_limiter.py: Replace use_fallback_llm mock with is_ci_environment - test_tracker.py: Replace fallback mode test with CI mode test - test_tracker_quality_stats.py: Remove 8 use_fallback_llm decorators - test_openai_api_key_usage.py: Remove LDR_USE_FALLBACK_LLM skipif - test_llm_provider_integration.py: Remove LDR_USE_FALLBACK_LLM skipif - test_ci_config.py: Remove LDR_USE_FALLBACK_LLM env var setting - test_search_system.py: Remove LDR_USE_FALLBACK_LLM skipif - run_all_tests.py: Remove LDR_USE_FALLBACK_LLM log line - test_env_auto_generation.py: Remove testing.use_fallback_llm mapping - test_lmstudio_provider.py: Fix docstring referencing removed function * refactor: remove fallback LLM from providers, settings, CI, and tests - Remove FakeListChatModel import and get_llm_setting_from_snapshot wrapper - Update all provider imports to use get_setting_from_snapshot directly - Remove LDR_USE_FALLBACK_LLM env var from CI workflows - Remove use_fallback_llm setting and registry function - Remove skip_if_using_fallback_llm fixture from conftest.py - Update tests to expect ValueError instead of fallback model * refactor: remove fallback model from llm_config and thread_settings - Remove get_fallback_model() and all call sites in get_llm() - Replace fallback returns with descriptive ValueError raises - Remove LDR_USE_FALLBACK_LLM env check block from get_llm() - Remove check_fallback_llm parameter from get_setting_from_snapshot - Remove get_llm_setting_from_snapshot convenience wrapper - Add ValueError re-raise in Ollama model-not-found path - Regenerate golden master with ensure_ascii=False for proper Unicode * fix: restore requires_llm skip mechanism and fix CI test failures Three fixes for CI regressions from fallback LLM removal: 1. Restore @pytest.mark.requires_llm decorator and skip fixture (skip_if_no_real_llm) that checks LDR_TESTING_WITH_MOCKS env var. Re-add decorators to 17+ tests across 9 files that need real LLMs. 2. Fix type coercion in test_openai_api_key_usage.py by converting fixture from dict format to simplified raw-value format, bypassing get_typed_setting_value string coercion. 3. Fix golden master format mismatch by adding ensure_ascii=False to test serialization to match regeneration script. Narrow pre-commit hook trigger to only defaults/.json files. fix: remove remaining fallback LLM references from coverage tests - Delete TestGetFallbackModel class from test_llm_config_coverage.py (5 tests that imported removed get_fallback_model) - Update test_llm_config_missing_coverage.py: 6 tests that expected FakeListChatModel fallback now expect ValueError/exception raises - Remove use_fallback_llm mocks from test_rate_limiting_tracker_coverage.py (delete 4 fallback-specific tests, fix 9 tests) - Remove use_fallback_llm mocks from rate_limiting/test_tracker_coverage.py (fix _make_tracker helper and 25 tests) - Add @pytest.mark.requires_llm to test_analyze_documents_minimal - Merge upstream main to pick up new coverage test files * fix: remove dead LDR_USE_FALLBACK_LLM env var from accessibility tests CI This env var was added to the accessibility test server but has no effect since the fallback LLM code was removed. * fix: align pre-commit hook description and error listing with defaults-only trigger The hook file pattern was narrowed to defaults/ only, but the description and error-listing code still referenced config/. Remove dead config/ path from the file listing and update messaging to match. * fix: update test_llm_config_deep_coverage.py for fallback LLM removal File was added on main after branch diverged. Remove TestGetLlmFallbackEnvVar class (tests removed functionality) and update test_provider_lowercased to expect ValueError instead of fallback model. * fix: improve "none" provider error message and fix stale CI-mode test - Add explicit handler for provider="none" with user-friendly message instead of misleading "this is a bug" error - Fix test_load_estimates_skipped_in_ci_mode: _load_estimates no longer checks is_ci_environment, test now correctly verifies deferred loading behavior in non-programmatic mode - Update 4 test assertions to match new "none" provider error message	2026-03-20 13:24:59 +01:00
LearningCircuit	b524bd9a45	fix: debug logging now visible on stderr when LDR_APP_DEBUG=true (#2761 ) * fix: debug logging now visible on stderr when LDR_APP_DEBUG=true config_logger() had stderr hardcoded to INFO level regardless of the debug flag — only diagnose= was toggled, not the log level itself. DEBUG entries went to the DB sink but never to the console, making LDR_APP_DEBUG ineffective for local debugging via log files. Also adds restart_server_debug.sh for convenient debug-mode startup with LDR_APP_DEBUG=true and LDR_LOG_SETTINGS=summary. * fix: log warning when debug mode is active Emits a WARNING-level message on startup when LDR_APP_DEBUG=true so it's immediately visible in the logs that sensitive data may be logged.	2026-03-15 17:24:31 +01:00
LearningCircuit	09f306f0c1	fix: set HOME=/home/ldruser in entrypoint before dropping to non-root (#2520 ) setpriv changes UID/GID but does not update HOME. Without this, HOME stays as /root/ and platformdirs resolves data paths to /root/.local/share/local-deep-research/ which ldruser cannot write to. This causes PermissionError on startup when LDR_DATA_DIR is not explicitly set (e.g. in the docker-multiarch-test workflow). The Dockerfile already uses this pattern during build (line 246) but the entrypoint was missing it.	2026-03-02 19:48:02 +01:00
LearningCircuit	04a55f106f	security: replace gosu with setpriv and suppress 8 unfixable CVEs (#2501 ) Replace gosu (Go binary) with setpriv (util-linux, already in base image) for privilege dropping in the container entrypoint. This eliminates 7 Go stdlib CVEs (CVE-2025-4674, CVE-2025-61732, CVE-2025-61731, CVE-2025-47907, CVE-2025-61729, CVE-2025-58187, CVE-2025-58188) by removing the only Go binary from the image. For the remaining 8 CVEs that are unfixable in Debian Trixie (libtiff6, coreutils, libc6, Chrome DevTools), add documented suppressions to both .grype.yaml (new) and .trivyignore with review date 2026-09-01. Also updates the base image digest to pick up latest security patches, and bumps Playwright from 1.57.0 to 1.58.0 (matching pyproject.toml) with the corresponding chromium-1208 revision.	2026-03-01 23:37:26 +01:00
LearningCircuit	d93adba1cd	feat: add one-command golden master regeneration script (#2475 ) Adds scripts/dev/regenerate_golden_master.py that regenerates the golden master settings snapshot in a single command, replacing the previous 3-step process (delete → pytest → stage). Updates the pre-commit hook message to reference the new script.	2026-02-28 15:45:04 +01:00
LearningCircuit	951a97f375	fix(docker): add diagnostic error message when gosu fails in LXC (#2373 ) If gosu can't switch users (e.g. LXC blocks CAP_SETUID/CAP_SETGID), print a clear error message with actionable fix instructions instead of gosu's cryptic "operation not permitted" error.	2026-02-22 22:26:15 +01:00
LearningCircuit	20fedc67b1	docs: add config docs generator script (#2134 ) * docs: add config docs generator script Add scripts/generate_config_docs.py that auto-generates docs/CONFIGURATION.md from default settings JSON files and env_definitions/ modules. Supports both database-managed settings and pre-database env-only settings. Extracted from PR #1393. Co-authored-by: daryltucker <daryltucker@users.noreply.github.com> * docs: improve config docs generator with auto-discovery, --check mode, and CI - Auto-discover env_definitions modules instead of hardcoding filenames - Extract additional AST fields: required, min/max_value, allowed_values, deprecated_env_var - Expand env-only settings table with Type, Required, Constraints, Deprecated Alias columns - Add --check mode (exit 1 when docs are stale) for CI validation - Add inline gitleaks:allow on key extraction line - Generate initial docs/CONFIGURATION.md covering all 18 JSON files and 5 env_definitions modules - Add check-config-docs.yml PR workflow (zero deps, stdlib only) - Add docs regeneration step to version_check.yml - Allowlist docs/CONFIGURATION.md in .gitleaks.toml (references env var names, not actual secrets) - Add comprehensive tests (27 tests: unit, integration, check mode, error handling) * docs: add CONFIGURATION.md references to README, env_configuration, and developing guides * docs: regenerate CONFIGURATION.md after merge with main Picks up db_config.cipher_memory_security default change (OFF -> ON). --------- Co-authored-by: daryltucker <daryltucker@users.noreply.github.com>	2026-02-22 21:06:27 +01:00
LearningCircuit	07ff140c16	security: Docker hardening and session/debug setting tightening Docker hardening: - Add no-new-privileges and cap_drop ALL to main LDR service - Add no-new-privileges to ollama service - Mount local_collections volumes as read-only (:ro) - Validate model name in ollama_entrypoint.sh to prevent injection - Add security warning to elasticsearch example about disabled xpack Application settings: - Make app.debug non-editable via UI to prevent enabling debug mode in production (can still be set via environment variable) - Reduce remember-me max from 90 to 30 days and default from 30 to 7 days to limit session persistence window	2026-02-04 19:42:30 +01:00
LearningCircuit	f848e8b0c2	ci: Add MCP server tests workflow (#1506 ) * ci: Add MCP server tests workflow Add dedicated CI workflow for testing the MCP (Model Context Protocol) server implementation. This workflow: - Runs on changes to MCP-related files - Verifies MCP module loads correctly - Tests discovery tools (list_strategies, list_search_engines, get_configuration) - Runs full MCP unit test suite with mocks - Tests MCP strategy (ReAct pattern) implementation - Verifies server startup behavior The tests use mocks to avoid requiring an LLM backend, making them fast and reliable in CI environments. Prepares CI infrastructure for PR #1366 (MCP server feature). * refactor: Move MCP smoke tests to external script Address review feedback from djpetti: - Extract MCP module loading test to scripts/mcp_smoke_test.sh - Extract MCP server startup test to the same script - Update workflow to call the external script - Add script path to workflow triggers * ci: skip MCP tests when server module not implemented The MCP server module (src/local_deep_research/mcp/server.py) is in a separate feature branch. This change makes the MCP test workflow skip gracefully when the module doesn't exist, with a clear notice. Tests will automatically run once the MCP feature branch is merged.	2026-01-31 23:12:11 +00:00
LearningCircuit	7be59b4dc0	refactor: Address PR review feedback (#1570 ) 1. Move inline DB init script to external file (scripts/ci/init_test_database.py) for better maintainability per djpetti's suggestion. 2. Fail fast in CI when pre-created user login fails instead of falling back to slow registration. This makes debugging easier - if the CI user doesn't work, something is wrong with workflow setup and should be fixed there. Per djpetti's suggestion about developer experience. The external script is now shared between critical-ui-tests.yml and extended-ui-tests.yml, reducing duplication. Co-authored-by: Daniel Petti <djpetti@gmail.com>	2026-01-18 16:14:24 -05:00
LearningCircuit	804cd923fd	fix: Complete RAG Docker cache path fix and bump version to 1.3.24 Fixes remaining RAG cache path issues not addressed in #1563: 1. library_rag_service.py: Changed from Path.home() / ".cache" to get_cache_directory() / "rag_indices" to respect LDR_DATA_DIR 2. docker-compose.yml: Fixed volume mount from /root/.cache/... to /data/cache/rag_indices (app runs as ldruser, not root) 3. ldr_entrypoint.sh: Added rag_indices directory creation with proper permissions The original fix (#1563) addressed search_engine_local.py but missed library_rag_service.py which still used hardcoded Path.home()/.cache. Fixes issue reported on Discord where RAG indexing failed with: - PermissionError: [Errno 13] Permission denied: '.cache'	2026-01-03 15:14:55 +01:00
LearningCircuit	acc5b585f0	Merge pull request #1191 from LearningCircuit/feat/e2e-research-test feat: add E2E research test via PR label trigger	2025-12-04 09:06:14 +01:00
LearningCircuit	57cda9babb	fix: remove debug output that was corrupting JSON response	2025-12-03 11:51:12 +01:00
LearningCircuit	cf6d9f6b2c	Merge dev into sync-main-to-dev Resolved version conflict - keeping dev version 1.3.0	2025-12-03 10:59:06 +01:00
LearningCircuit	1fe134588d	debug: add logging and explicitly pass search_tool parameter	2025-12-03 10:58:57 +01:00
LearningCircuit	c43941dda0	fix: use correct Serper API key settings path in E2E script The script was setting the API key at 'search.serper.api_key' instead of the correct path 'search.engine.web.serper.api_key'. This caused the search engine factory to fail finding the key, falling back to other engines instead of using Google search via Serper.	2025-12-03 00:24:41 +01:00
tombii	08f8a733a2	Fix matplotlib cache directory permissions Matplotlib requires a writable cache directory at ~/.config/matplotlib. The entrypoint now creates this directory with proper ownership for ldruser before starting the application. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-02 14:20:27 +01:00
tombii	76247ff048	Fix Docker volume permissions error for /data directory Fixes PermissionError when container tries to create /data/logs and other subdirectories. Docker named volumes are created with root ownership, but the application runs as ldruser (UID 1000). Changes: - Add entrypoint script (ldr_entrypoint.sh) to handle volume setup - Install gosu for safe privilege dropping - Create required subdirectories with correct ownership - Use 700 permissions for security (owner-only access) - Remove USER directive (entrypoint handles user switching) The entrypoint runs as root to fix permissions, then drops to ldruser before starting the application. This is the standard Docker pattern for handling volume permissions with non-root containers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-02 14:09:12 +01:00
LearningCircuit	e2ab7d0f20	fix: use formatted_findings from API and proper source format - Use formatted_findings when available (already includes sources) - Remove custom extract_sources function - Fix source URL extraction (API uses 'link' not 'url')	2025-12-01 22:25:24 +01:00
LearningCircuit	b9680e58c9	feat: add argparse, static mode, and search sources to E2E test - Add argparse for easier local testing (per djpetti's review) - Add --mode static for regression testing with fixed query - Include search sources in JSON output and PR comments - Support both ldr_research and ldr_research_static labels - Static query: 'What is Local Deep Research and how does it work?'	2025-12-01 21:46:07 +01:00
LearningCircuit	d171eb6b1d	fix: use valid OpenRouter model (google/gemini-2.0-flash-001) as default	2025-11-30 22:21:40 +01:00
LearningCircuit	b6ca3cec46	fix: pass API keys through settings snapshot	2025-11-30 02:01:02 +01:00
LearningCircuit	d28f269d10	fix: set llm.provider in settings to use OpenRouter instead of Ollama	2025-11-30 01:54:19 +01:00
LearningCircuit	d4327184a4	feat: add E2E research test via PR label trigger Add a reusable script and GitHub Actions workflow that tests the complete LDR pipeline (OpenRouter + Serper) by researching PR diffs. - scripts/ldr-diff-research.py: Standalone script that reads diff from stdin and outputs JSON with research results. Can be tested locally. - .github/workflows/e2e-research-test.yml: Workflow triggered by 'ldr_research' label that runs the script and posts results as a PR comment. Configurable via environment variables: - LDR_PROVIDER: LLM provider (default: openrouter) - LDR_SEARCH_TOOL: Search tool (default: serper) - LDR_MODEL: Model name (optional) - LDR_ITERATIONS: Research iterations (default: 1) Required secrets: OPENROUTER_API_KEY, SERPER_API_KEY	2025-11-30 01:29:15 +01:00
LearningCircuit	0d26c46c8a	Merge dev into sync-main-to-dev - resolve conflicts Resolved conflicts: - .gitleaks.toml: Combined regex patterns from both branches, added path allowlists - pyproject.toml: Kept updated versions from dev + added hypothesis from main - __version__.py: Keep 1.3.0 from dev - news.js: Removed duplicate toggleExpanded function (already exists at line 1291) - pdm.lock: Regenerated with pdm lock	2025-11-29 19:36:36 +01:00
LearningCircuit	309b2a619e	Fix shellcheck warnings in all shell scripts - Quote variables to prevent word splitting (SC2086) - Use 'read -r' to prevent backslash mangling (SC2162) - Use 'cd ... \|\| exit' for safe directory changes (SC2164) - Use '-n' instead of '\! -z' for string checks (SC2236) - Use pgrep instead of ps \| grep (SC2009) - Check exit codes directly instead of using $? (SC2181) - Declare and assign separately for exports (SC2155) - Fix unused loop variables with underscore prefix (SC2034) - Remove stray markdown backticks from ollama_entrypoint.sh	2025-11-27 19:18:10 +01:00
LearningCircuit	cff33086ec	fix: resolve CI test failures (actionlint and trivy-scan) - Restore missing scripts/ollama_entrypoint.sh required by Dockerfile - Update actions/setup-python from deprecated v4 to v5 in workflow files - Fix security issue: move untrusted github.head_ref to environment variable - Fix shellcheck warnings: quote variables and use block redirects These changes address pre-commit actionlint failures and trivy-scan Docker build errors.	2025-11-11 22:25:27 +01:00
LearningCircuit	ed0212ba53	Delete scripts/dev/kill_servers.py	2025-11-02 00:16:39 +01:00
LearningCircuit	ceb1526de6	Delete scripts/ollama_entrypoint.sh	2025-11-02 00:16:12 +01:00
LearningCircuit	6738ecf86c	Delete scripts/test_unified_indexing.py	2025-11-02 00:15:45 +01:00
LearningCircuit	5c46491e9c	Delete scripts/create_unified_library_tables.py	2025-11-02 00:15:19 +01:00
LearningCircuit	a0025bde7c	Delete scripts/create_integrity_tables.py	2025-11-02 00:15:06 +01:00
LearningCircuit	590e0112c2	unified rag and library collection this will be more maintanable	2025-10-28 01:37:48 +01:00
github-actions[bot]	f8b1477042	Merge remote-tracking branch 'origin/dev' into sync-main-to-dev	2025-09-22 16:34:29 +00:00
LearningCircuit	f26bb4545a	Merge pull request #849 from LearningCircuit/LearningCircuit-patch-1 Delete scripts/check_research_db.py	2025-09-22 18:34:16 +02:00
github-actions[bot]	de4354ccfe	Merge remote-tracking branch 'origin/dev' into sync-main-to-dev	2025-09-22 16:22:14 +00:00
LearningCircuit	e6e21d4407	Delete scripts/check_benchmark_db.py Not sure if this is really usefull	2025-09-22 01:18:58 +02:00
LearningCircuit	7087689690	Delete scripts/check_research_db.py	2025-09-22 01:15:59 +02:00
LearningCircuit	d0b900f8e3	Delete scripts/check_metrics.py	2025-09-21 15:55:44 +02:00

1 2

63 Commits