mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
a5cd3c58d73c7b0d8d11beaa28617323296ae0c5
163 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
cc98aa95f5 | docs: add interpretation guide to BENCHMARKING.md (#3723) | ||
|
|
1126e3747d |
chore: auto-bump version to 1.6.4 (#3682)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
2f7056a52c |
feat(notifications): default-off + env-only master switch for SSRF rebinding risk (#3675)
* docs(security): document DNS-rebinding TOCTOU window in notification SSRF The notification URL validator (PR #3092 / #3311) resolves hostnames once at validation time and checks resolved IPs against private ranges, but Apprise re-resolves at send time -- a DNS-rebinding attacker can serve a public IP at validation and a private IP at send. Apprise exposes no DNS/Session hook to close this in code without fragile monkey-patching of its plugin internals. Given LDR's threat model (single-tenant local app, @login_required on settings routes, per-user encrypted SQLCipher DBs), the residual risk is acceptable as long as it's visible. This change makes it visible: - Updated the inline comment in NotificationURLValidator._is_private_ip to describe the TOCTOU window and recommend plugin schemes (discord://, slack://, ntfy://, etc.) over raw http(s):// webhooks. - Added a parallel comment in ssrf_validator.validate_url, since safe_requests has the same pattern. - Added a "Notification Webhook SSRF" subsection to SECURITY.md with the rebinding window, the rationale for not closing it in code, the threat-model factors that make it acceptable, and operator-side mitigations (prefer plugin schemes, restrict egress). No behavior change. * feat(notifications): default-off + env-only master switch (LDR_NOTIFICATIONS_ENABLED) Outbound notifications via Apprise carry a known DNS-rebinding TOCTOU window: the URL validator resolves once at config time, but Apprise re-resolves at send time, and a logged-in user with a controllable domain can rebind to internal services on the LDR server (e.g. 127.0.0.1:<internal-port>) or the local network. The window cannot be closed in code without fragile monkey-patching of Apprise's plugin internals (HTTPS-only, doesn't follow redirects). Since LDR is multi-user (per-user SQLCipher DBs behind @login_required), the right default is to keep outbound notifications off until the operator explicitly opts in -- a server-level decision, not something a logged-in user can flip via the settings API. Changes: - Add notifications.enabled env-only setting (default False), registered alongside notifications.allow_private_ips in env_definitions/security.py. Auto-mapped to LDR_NOTIFICATIONS_ENABLED. - NotificationManager reads the env at __init__ and gates send_notification before any other check; force=True bypasses per-user toggles only, never the operator switch. - NotificationService now takes enabled=False; test_service refuses with a clear error pointing at LDR_NOTIFICATIONS_ENABLED. The settings route /api/notifications/test-url passes the env-read value through. - Refresh inline TOCTOU comment in NotificationURLValidator._is_private_ip to reflect the new gate, and add a parallel comment near getaddrinfo in ssrf_validator.py for cross-cutting consistency (same TOCTOU pattern). - Rewrite the SECURITY.md "Notification Webhook SSRF" subsection: lead with "disabled by default", explain how to enable, document the residual risk operators are accepting when they flip the switch. - Tests: - tests/notifications/conftest.py autouse-enables the gate so existing tests exercising the inner logic still work. - TestMasterSwitchEnvGate covers the gate behavior explicitly: env unset => send_notification returns False (even with force=True), test_service returns a disabled error. - TestNotificationManager in test_notification_coverage.py gets a class-scoped autouse fixture for the same reason. - Existing NotificationService(...) calls in tests pass enabled=True so their inner-logic assertions keep working. This is a behavior change. Existing users with notifications working will need to set LDR_NOTIFICATIONS_ENABLED=true on upgrade. * fix(notifications): rename env gate to allow_outbound + clearer logs + docs Two issues with the previous commit: 1. Key collision. The env gate was named notifications.enabled, which is already a (currently dormant) per-user DB setting in default_settings.json. Renaming the env-only setting to notifications.allow_outbound (env: LDR_NOTIFICATIONS_ALLOW_OUTBOUND) keeps the two layers distinct. Symmetric with the existing notifications.allow_private_ips env-only setting. 2. Log levels. The gate-closed paths logged at DEBUG, which is invisible under default log configuration. An operator wondering why notifications aren't firing wouldn't see the actionable signal. Upgrade to WARNING with messages that explicitly name the env var and point at SECURITY.md. Also: - Regenerate docs/CONFIGURATION.md (auto-generated from env definitions + default_settings.json) so LDR_NOTIFICATIONS_ALLOW_OUTBOUND appears in the env-only table at line 52. - Add a "Server-Side Opt-In Required" section at the top of docs/NOTIFICATIONS.md, including the symptoms an operator would see when the gate is closed (so debugging "why isn't this working?" is a one-step lookup). - Rename NotificationService kwarg enabled -> outbound_allowed and the manager's self._notifications_enabled -> self._outbound_allowed for internal consistency with the new setting name. - Update tests + conftest accordingly. 507 tests pass, pre-commit clean. * fix(notifications): defense-in-depth gate in service.send() + module-scope test fixture Two non-blocker recommendations from code review on PR #3675: 1. service.send() did not enforce the outbound_allowed gate itself -- the manager always wraps it, but a future direct caller could bypass. Add the same WARNING-level guard at the top of send() that test_service already has, so the security boundary lives at the service layer (one place) instead of relying on call-chain discipline. 2. Promote tests/web/services/test_notification_coverage.py's autouse gate-opening fixture from class-scope (TestNotificationManager only) to module-scope, so any future test class added to the file picks it up automatically. Drop the now-redundant class-scoped duplicate. Tests: - TestSendOutboundGate in tests/notifications/test_service.py covers the new gate: outbound_allowed=False => send() returns False without touching Apprise (.notify and .add must not be called); gate open => the existing send path runs. - _make_service helper in test_service_extra_coverage.py now sets outbound_allowed=True so the SSRF/Apprise-failure tests exercise the inner logic, not the gate. 509 passed, 1 skipped, pre-commit clean. |
||
|
|
08969f5ad2 |
config: disable general.enable_fact_checking by default (#3672)
* config: disable general.enable_fact_checking by default Flip the default for general.enable_fact_checking from true to false. When enabled, the citation handler does an extra LLM call per analyze_followup that re-analyzes sources for consistency and injects the result into the synthesis prompt. For agentic strategies (LangGraph in particular) this is largely duplicative — the agent already cross-references sources during its tool loop — and adds cost, latency, and prompt bloat without clear quality gains. Users who rely on the extra validation pass can re-enable it via the setting. See #3671 for the discussion and trade-offs. Refs #3671 * config: sync docs and edge-case test to new fact-check default - Regenerate docs/CONFIGURATION.md so the documented default for general.enable_fact_checking reflects the new value (false). - Update test_fact_checking_enabled_by_default in test_citation_handler_edge_cases.py: with no settings_snapshot, the handler now sees fact-checking as disabled, so analyze_followup invokes the LLM only once. Renamed to test_fact_checking_disabled_by_default and flipped the call-count assertion accordingly. Refs #3671 |
||
|
|
c45785dc63 |
chore: auto-bump version to 1.6.2 (#3639)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
d3570c355a |
refactor: remove dead benchmark and citation functions (#3187)
* refactor: remove dead benchmark and citation functions * cleanup: drop orphan cli.py stub, orphaned tests, stale docs Follow-up to #3187 addressing djpetti's review and the failing All Pytest Tests + Coverage check. - Delete benchmarks/cli.py entirely. The file was already shadowed by the benchmarks/cli/ package (same import path), so the deprecation stub was unreachable dead code. - Remove test classes that imported now-deleted functions: check_system_resources, plot_parameter_importance, plot_quality_vs_speed, CitationFormatter._to_superscript. This is what the pytest lane was failing on. - Update docs/cli-tools.md and benchmarks/metrics/README.md to drop references to the removed CLI module and plot helpers. |
||
|
|
ac23d3f847 |
docs(websocket): document auth requirement for WS handshake (#3658)
Two small additions reflecting PR #3127: - troubleshooting.md: add an "Authentication / expired session" bullet to the WebSocket Progress-Updates-Not-Showing section so users know the handshake can be rejected after session expiry or server restart. - env_configuration.md: clarify in the CORS / WebSocket Security section that WebSocket connections require an authenticated session in addition to passing the CORS check. |
||
|
|
503d244562 |
chore: auto-bump version to 1.6.0 (#3399)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
3b1d6c6b2f |
feat: redesign journal quality system with data-driven scoring and predatory auto-removal (#3081)
* feat: redesign journal quality system with data-driven scoring and predatory auto-removal
Replace the expensive LLM-based journal scoring (SearXNG + AdvancedSearchSystem
per journal) with a tiered data-driven approach:
Tier 0: DB cache (instant, from previous runs)
Tier 1: Predatory check — auto-removes results from blacklisted journals/publishers
Tier 2: OpenAlex snapshot — h-index + DOAJ from ~217K sources (downloaded at runtime)
Tier 3: DOAJ check — quality floor for open access journals (downloaded at runtime)
Tier 4: LLM analysis — SearXNG fallback (now optional, not required)
Bundled data:
- Stop Predatory Journals: 6K predatory publishers/journals (MIT license)
Downloadable data (CC0, loaded if present):
- OpenAlex sources snapshot: 217K journals/conferences with h-index, impact factor
- DOAJ journals: 22K+ journals with DOAJ Seal status
Key changes:
- Extended Journal DB model with bibliometric fields (h-index, impact factor,
DOAJ, predatory status, provenance tracking) + Alembic migration
- JournalReputationFilter now uses tiered scoring with journal dedup
- SearXNG no longer required — filter works with bundled data alone
- Predatory journals auto-removed (with whitelist override for false positives)
- Added journal filter to Semantic Scholar (was the only scientific engine without it)
- OpenAlex results now include source_id and source_type for direct lookups
- Fixed score parsing (regex instead of strict int()), prompt truncation,
fail-fast on SearXNG failures, lru_cache on name cleaning
* fix: address code review findings from Round 1
- Remove dead __check_result method, update tests to use filter_results
- Fix predatory substring matching (min length guard prevents false positives)
- Add name parameter to is_whitelisted for journals without ISSN
- Fix migration: server_default for Booleans, correct index creation logic
- Improve safety net logging in filter_results
* fix: forward journal quality fields through _get_full_content (Round 2 review)
OpenAlex _get_full_content was constructing a new result dict without
forwarding journal_ref, openalex_source_id, and source_type from the
preview. This effectively disabled journal quality filtering for all
OpenAlex results since the content filters run after full content
retrieval and couldn't find the journal_ref key.
* fix: address Round 3 review findings — bugs, thread safety, tests
Critical bug fixes:
- Add missing quality_model column to migration 0005
- Fix dedup to use richest metadata (two-pass approach)
- Predatory cache entries no longer expire via normal TTL
Performance:
- Build indexed sets for predatory data at load time (O(1) exact match)
- Add threading.Lock for singleton and lazy property loading
Data quality:
- Deduplicate predatory.json (removed 21 dupes)
Test coverage (38 new tests):
- JournalDataManager: derive_quality_score, is_predatory, is_whitelisted,
lookup_openalex, lookup_doaj, _expand_openalex_record, singleton
* fix: address all review findings — critical bugs, security, performance
Critical bugs: NASA ADS journal_ref, empty string guard, regex name
cleaning with LLM fallback, DOAJ field overwrite protection, predatory
cache TTL re-evaluation.
Security: prompt injection sanitization, log injection prevention,
Unicode NFKC normalization for predatory lookups.
Important bugs: predatory publish-after-indexes race fix, Tier 0 DB
error handling.
Performance: regex-based name cleaning eliminates ~5 LLM calls/batch.
* fix: .text() → .content for LangChain, improve regex name cleaning
Critical runtime fix:
- LangChain AIMessage has .content attribute, not .text() method.
Both LLM calls in the filter (name cleaning and Tier 4 scoring)
would crash with AttributeError at runtime. Fixed both occurrences
and updated all test mocks.
Regex improvements:
- Add bare trailing citation number stripping (", 95, 146802")
- Add volume(issue) pattern stripping ("141(5)")
- Fix month regex: require at least 1 digit after month name and
add word boundaries (prevents "May" in journal names being stripped)
- Only skip LLM when regex result has no residual numerics — complex
citation strings like "Phys. Rev. Lett. 95, 146802 (2005)" correctly
fall through to LLM instead of returning partially-cleaned name
* feat: add journal quality dashboard at /metrics/journals
Dashboard with summary stats, quality distribution chart, score source
doughnut, sortable/filterable journal table with pagination, quality
badges, trust signal icons, empty state, help panel, mobile responsive.
API: GET /metrics/api/journals — all journals + summary in one call.
* fix: XSS prevention, missing API fields, sort null handling in dashboard
Security:
- Add escHtml() helper for HTML entity escaping in all innerHTML
injections (journal names, publishers, predatory_source, source badges)
- Prevents XSS via crafted journal names containing HTML/JS
API:
- Add works_count and cited_by_count to journal API response
(bibliometric fields useful for dashboard display)
UX:
- Fix sort comparison with null values: nulls pushed to end consistently
instead of unpredictable placement from mixed Infinity/string comparison
* fix: dashboard null-quality filter, avg h-index N/A, core label
- Fix null-quality journals appearing in predatory tier filter
(quality || 0 coerced null to 0, which passed predatory check)
- Fix avg h-index showing "0" when no journals have h-index data
(API now returns null, frontend shows "—")
- Rename "Scopus Indexed" to "Core Indexed" (OpenAlex is_core
is CWTS core status, not Scopus indexing)
* feat: SQLite reference DB for dashboard with server-side pagination
Replace client-side 212K journal array with a shared read-only SQLite
database built from bundled JSON on first access. Near-zero RAM usage.
* perf: split summary from pagination queries in journal dashboard
Summary stats + chart data (3 SQL queries, ~130ms) are now fetched
only on initial page load via include_summary=true param. Subsequent
pagination, sorting, and filter changes only fetch the journal page
(1 query, ~7ms), making navigation feel instant.
* fix: expose Chart.js globally, split summary from pagination queries
- Add window.Chart = Chart in app.js so inline scripts can use Chart.js
(was imported but never exposed on window — caused ReferenceError)
- Split summary from pagination: include_summary=true only on initial
load, page/filter/sort skip the 3 extra SQL queries
- NOTE: run `npm run build` to rebuild the Vite bundle
* fix: guard Chart.js usage and defer initial load for module script timing
The Vite bundle loads as type="module" (deferred), but the inline
script in journal_quality.html runs immediately. Chart is not yet on
window when the script executes, causing ReferenceError that kills
the entire script block including the data loading call.
Fix: guard Chart usage with typeof checks, defer loadJournalPage
to window.onload so module scripts have finished executing.
* feat: upgrade journal filter logs from debug to info level
Users can now see the tiered scoring process in their logs:
- Tier 0: cache hit with score
- Tier 1: predatory detection + whitelist override
- Tier 2: OpenAlex match with h-index
- Tier 3: DOAJ match with seal status
- Tier 4: LLM analysis result
- Summary: passed/below-threshold/predatory breakdown
* fix: add 'the' prefix fallback for journal name lookups, add lookup logs
Many OpenAlex journals start with 'The ' (e.g., 'The Astrophysical
Journal Letters') but ArXiv journal_ref omits it. Now tries with/without
'the ' prefix when exact match fails — fixes ~5K potential Tier 2 misses
that would unnecessarily fall through to expensive Tier 4 LLM analysis.
Applied to both JournalDataManager (in-memory) and JournalReferenceDB
(SQLite). Added debug-level logs for lookup hits/misses.
* feat: quality tags in sources, sidebar menu, documentation
- Attach journal quality score to each result in filter_results
- Display quality tags in research output source lists:
[Q1 ★★★★★] for elite, [Q2 ★★★] for moderate, etc.
- Add "Journals" item to sidebar under Analytics section
- Create docs/journal-quality.md with full system documentation
* fix: restore docstrings, increase DOAJ Seal score, fix truncated file
Address djpetti's review comments:
- Restore full Args/Returns docstrings on __init__, create_default,
__db_session, __make_search_system, __clean_journal_name,
__analyze_journal_reputation, __save_journal_to_db
- Remove "unlike the previous version" reference from create_default
- Add clarifying comment on regex vs LLM name cleaning tradeoff
- Increase DOAJ Seal score from 6 to 7 (2-point spread vs 1-point)
- Fix file truncation from disk-full error (line 763)
* refactor: move build logic into journal_reference_db module
Eliminate sys.path hack, make build logic importable. Script is now
a thin CLI wrapper. derive_quality_score imported from data_manager
(canonical copy) instead of duplicating.
* fix: review findings — docs, sidebar, dashboard, test gaps
Address final review round findings:
- Fix DOAJ Seal score in docs (6→7)
- Sidebar: use url_for() instead of hardcoded URL
- Template: set active_page='journal-quality' for sidebar highlight
- Rename stat-scopus to stat-seal with label "DOAJ Seal" (was mislabeled)
- Always use window.onload for initial load (readyState fast path unsafe)
- Add tests for _format_quality_tag (6 tests, all 5 tier branches + None)
- Add tests for "the" prefix fallback in lookup_source (2 tests)
* feat: add CORE conference rankings (795 CS conferences)
Bundle CORE Rankings (ICORE2026) for automatic conference scoring:
A*→9, A→7, B→5, C→4. Acronym + proceedings prefix matching.
Eliminates Tier 4 LLM calls for major CS conferences.
* feat: add data source attribution to journal quality dashboard
Credit the open academic data projects that make the dashboard possible:
OpenAlex (CC0), DOAJ (CC0), CORE Rankings, Stop Predatory Journals (MIT).
Displayed as an attribution section at the bottom of the page.
* fix: remove CORE conference data (no open license)
CORE Rankings are copyrighted (c) 2013 Computing Research & Education
with no published open license. Redistribution in an MIT project is
not permitted without explicit permission.
Removed core_conferences.json from bundled data. The build function
_load_core_conferences gracefully returns {} when the file is absent.
Conference matching still works via OpenAlex data + proceedings prefix
stripping.
Verified remaining data licenses:
- OpenAlex: CC0 Public Domain (confirmed)
- DOAJ metadata: CC0 (confirmed on doaj.org)
- Stop Predatory Journals: MIT License (confirmed in GitHub LICENSE)
* docs: add data source attribution to README, docs, code, and dashboard
Credit open academic data projects at multiple touchpoints:
- README.md: Journal Quality feature links to data sources
- docs/journal-quality.md: expanded attribution table with websites
- data/__init__.py: license details per bundled file
- journal_reference_db.py: data sources in module docstring
- Dashboard: attribution section with links (already added)
All bundled data verified: OpenAlex (CC0), DOAJ metadata (CC0),
Stop Predatory Journals (MIT).
* fix: DOAJ Seal score consistency across all tiers
Tier 2 (OpenAlex) now cross-references DOAJ for Seal status via
dm.has_doaj_seal(issn). Tier 3 now calls derive_quality_score
instead of hardcoding score=6. All tiers consistently score
DOAJ Seal at 7. Fixed docs inconsistency.
* feat: add CitationMetadata model for structured academic metadata
New citation_metadata table stores bibliographic data on academic
research sources using CSL-JSON vocabulary. 1:1 with ResearchResource.
- CitationMetadata model: doi, arxiv_id, pmid, authors, year,
volume, issue, pages, container_title, journal_id FK, csl_json
- Migration 0006: create table + indexes
- citation_normalizer.py: engine-specific → CSL-JSON normalization
- extract_links: preserve citation fields (was dropping 90% of data)
- research_sources_service: create CitationMetadata for academic sources
- Quality never stored — derived via journal_id at query time
* refactor: simplify Journal table to only cache Tier 4 LLM results
Tiers 1-3 use bundled data (instant, no caching needed). Only Tier 4
(LLM) results cached in DB. Wire up journal_id FK on CitationMetadata.
* feat: auto-download journal data from GitHub Releases
Replace bundled data files with on-demand download:
- journal_data_downloader.py: fetch from GitHub Releases on first use
- Data in user dir (not package dir, read-only in pip installs)
- Dashboard shows download banner when data missing
- API: GET/POST /metrics/api/journal-data/{status,download}
- predatory.json (307KB) stays bundled, large files never in git
* refactor: fetch journal data from APIs instead of GitHub Releases
Fetch directly from OpenAlex and DOAJ public APIs. No redistribution
concerns — data fetched fresh from CC0 sources (~3 min first run).
* fix: review findings — h_index=0 edge case, dead code, missing field
- derive_quality_score: h_index=0 no longer bypasses DOAJ Seal score
(0 means newly indexed, not low quality)
- citation_normalizer: remove dead arxiv check in detect_engine
- extract_links: add source_engine to preserved fields
- paths.py: fix stale docstring (GitHub Releases → APIs)
* fix: DB race condition and journal name normalization (Round 3 review)
- Wrap __save_journal_to_db commit in try/except to handle concurrent
inserts gracefully (rollback + warning) instead of incorrectly
incrementing the SearXNG failure counter
- Add geographic qualifier stripping to regex cleaner: "(London)",
"(New York)", "(US)" etc. are now stripped deterministically,
preventing duplicate scoring of the same journal under variant names
* fix: DB race condition and journal name normalization (Round 3 review)
- S2 close() now calls super().close() to properly clean up the
JournalReputationFilter (SearXNG engine + LLM). Before this fix,
adding content_filters to S2 created a resource leak since S2's
close() override didn't delegate to BaseSearchEngine.close().
* fix: DB race condition and journal name normalization (Round 3 review)
- Fix predatory substring matching: check both directions for renamed
publisher variants while keeping >= 10 char guard
- DB cache read: logger.exception for stack trace preservation
- Model Boolean columns: add server_default=sa_false()
- Migration downgrade: drop indexes before columns
* fix: correct url_to_quality type annotation after merge (Round 4 review)
Type was `dict[str, dict]` but values are `int` scores from the journal
quality filter. Changed to `dict[str, int]`.
* fix: CI failures — sensitive logging and file write allowlist
- journal_data_downloader: use logger.exception() instead of f-string
with exception variable (sensitive-logging check)
- Add journal_data_downloader.py to file-write security check allowlist
(writes public CC0/MIT journal metadata, not user data)
* fix: skip journal reference DB tests when DB not built (CI timeout fix)
The test fixture was calling db.available which triggers _get_conn()
which auto-downloads 200K+ sources from OpenAlex API. In CI this caused
60s timeouts on 26 tests. Now checks db_path.exists() directly.
* fix: renumber migration 0005 → 0007 to resolve multiple-heads conflict
Main already has 0005_add_resource_document_id and 0006_add_citation_metadata.
Our migration was also numbered 0005, causing Alembic to reject login with
"multiple heads" error. Renumbered to 0007 with down_revision=0006.
* fix: align test mock chains with real Tier 0 DB query pattern
Tests were mocking .filter_by().first() but real code does
.filter_by().filter(score_source=="llm").first(). Fixed mock chains
to match. Also fixed docs typo: reanalysis_period default 265 → 365.
* fix: journal dashboard showing "not installed" when reference DB exists
get_journal_data_status() only checked for raw JSON source files, not
the compiled journal_reference.db. If the DB existed without source
JSONs (e.g., after cleanup), the dashboard refused to load.
* feat: add DOI-based venue identification and conference detection
Adds a pre-enrichment layer that resolves paper DOIs to OpenAlex source
IDs via batch lookup (up to 50 DOIs per HTTP request). This gives the
journal quality filter a reliable ID-based lookup path instead of
fragile name matching.
Changes:
- New: openalex_enrichment.py — batch DOI → source_id resolution
- Integration hook in search_engine_base.py for scientific engines
- Conference detection heuristic as fallback for papers without DOI
- Year stripping in OpenAlex lookup: "NeurIPS 2023" → "NeurIPS"
- NASA ADS now extracts DOI to result dict
- Fix stale AdvancedSearchSystem mocks in tests
* fix: handle missing thread context in preview filter phase
The journal filter runs as a preview_filter (before LLM relevance) for
instant data lookups. But DB operations (Tier 0 cache, save) require
thread context which isn't available in the preview phase.
Fix: __db_session() returns None when no context available. Callers
skip DB operations gracefully — data-only tiers (1-3) still work.
* feat: disable Tier 4 LLM journal scoring by default (too slow)
* feat: institution scoring tier + DataSource refactor
- New DataSource ABC + registry under utilities/data_sources/ unifying
openalex, doaj, jabref, predatory, and institutions sources
- Add InstitutionSource (OpenAlex Institutions, ~123K records) for
affiliation-based scoring of preprints
- Add Tier 3.5 (institution lookup) to journal_reputation_filter
for the no-journal_ref salvage path and as a max() lift for
preprint repositories with weak Tier-2 scores
- Extract author affiliations in OpenAlex search engine
- Wire JournalReputationFilter into PubMed engine and fix journal_ref
field aliasing
- Tighten regex cleaner for journal_ref (year/month/volume debris)
- Delete bundled src/local_deep_research/data/ — all sources now
fetched at runtime with shared auto_download policy
- Dashboard banner shows all academic data sources with license + status
* refactor: consolidate journal-quality system into one package with SQLAlchemy
- New package src/local_deep_research/journal_quality/ groups all
journal-related modules (downloader, db, models, scoring, data_sources)
- Single source of truth: gz files compile into one journal_quality.db
via build_db(); JournalDataManager dict-based loader is deleted
- SQLAlchemy 2.0 ORM throughout (models.py + db.py); filter call sites
unchanged thanks to dict-shaped lookup return values
- Read-only enforcement at three layers: SQLite mode=ro&immutable=1,
POSIX chmod 0o444 after build, and a pre-commit hook that bans
cross-module writable opens of journal_quality.db
- Downloader rebuilds the DB synchronously after each successful fetch
- New tables: predatory_journals/_publishers/_hijacked, institutions,
abbreviations
- Tests migrated to tests/journal_quality/; 207 tests pass
* fix: P0/P1 bugs from journal-quality code review
- P0: flag hijacked journals as predatory in _populate_sources
(loaded into predatory_hijacked but never checked against sources)
- P0: insert DOAJ-only journals (~8K rows) via second pass over
doaj_data; previously only OpenAlex venues entered the DB
- P0: replace `mod._ref_db = None` with `reset_db()` in metrics
rebuild route (the singleton attr is `_db`, not `_ref_db`)
- P0: change JournalQualityDB._lock to RLock to prevent first-run
deadlock (_ensure_engine → build_db → reset_db re-acquires lock)
- P1: dedup sources on (name_lower, issn) so print + electronic
ISSN variants both survive; drop unique=True on Source.name_lower
- tests: cover hijacked, DOAJ-only, and dual-ISSN cases
* fix: resolve CI failures on journal-quality refactor
- pre-commit: add missing .pre-commit-hooks/check-journal-quality-readonly.py
to git (file existed locally but was never committed, so CI couldn't
exec it)
- file-writes scan: extend allowlist to cover the new
journal_quality/downloader.py and journal_quality/data_sources/*.py
paths (the old `journal_data_downloader.py` entry no longer matches
after the package move)
- mypy: fix 12 errors in journal_quality/db.py
- explicit list[] annotation on `wheres`
- dict comprehension on Row sequence in get_source_distribution
- wrap loader returns in dict() so SQLAlchemy stub Any-types resolve
- type: ignore[arg-type] on bulk_insert_mappings (known stub gap;
SQLAlchemy 2.x types accept type[T] at runtime but stubs say Mapper)
- CodeQL py/incomplete-url-substring-sanitization: anchor doi.org URL
parsing on scheme prefixes instead of substring `in` check
* refactor: address djpetti review comments on journal quality system
Tier 4 LLM scoring is now opt-in via the new
search.journal_reputation.enable_llm_scoring setting (default off) instead
of being unreachable behind a hardcoded flag. The redundant in-process
lru_cache on the LLM analyzer is gone - Tier 0 (DB cache) already covers
repeat lookups, and keeping the cache only masked DB write failures.
Trailing-year stripping for conference names ("NeurIPS 2023" -> "NeurIPS")
moves into __regex_clean_journal_name where it belongs, replacing the
post-hoc retry block in __score_journal.
DOAJ Seal score bumped 7 -> 8 to reflect the certification meaning more
faithfully (top ~10% of DOAJ journals, curated against best OA practices).
The h-index >= 7 tier mapping is unchanged so no test fixtures break.
Adds /api/journals/research/<id> + a "View Journals" button on the research
details page so users can see the journals encountered in a single research
session, not just the cross-research aggregate. Joins through
CitationMetadata -> ResearchResource without schema changes.
Adds quartile (Q1-Q4) as a display-only signal on Source rows, derived at
build time from cited_by_count percentile within each source_type. Quality
scoring is unchanged - h-index remains the canonical bibliometric.
Magic numbers in scoring.py / db.py extracted into a Journal Quality
Scoring Thresholds section in constants.py. Institution scoring is now
consolidated to scoring.py::institution_score_from_h_index, fixing an
unreachable branch in db.py::score_from_affiliations along the way.
Misc:
- OPENALEX_ENRICHMENT_API_TIMEOUT lifted into constants.py (was hardcoded 15)
- Deleted scripts/build_journal_reference_db.py - auto-build on first
access plus the dashboard rebuild button cover all use cases
* perf(journal-quality): switch data sources to bulk dumps + release-gate test
Replace paginated REST API fetches with public bulk snapshots:
- OpenAlex Sources: S3 manifest + parts (~280K, ~270s vs 5-10min)
- OpenAlex Institutions: S3 manifest + parts (~120K, ~156s vs 5-10min)
- DOAJ: single CSV dump (~22K, ~2s)
Bulk paths are the OpenAlex/DOAJ-recommended way to pull the full
dataset and eliminate hundreds of rate-limited requests on every
"Download Data" click. Compact output formats are preserved so the
build pipeline and runtime accessors are unchanged.
Add a release-gate integration test + dedicated workflow that
downloads all 5 sources in parallel, builds the reference DB end
to end, and scores a real journal. Catches upstream schema breaks
(renamed fields, restructured dumps) before we cut a release.
* test(journal-quality): exercise dashboard query methods in release gate
* docs(journal-quality): credit upstream data providers on dashboard
* docs(journal-quality): add 'How It Works' tab explaining tiered scoring
* fix(journal-quality): score unknown journals as 3, log institution names
- Lower truly-unknown journals (no OpenAlex/DOAJ/Tier 3.5 hit) from
pass-through to score 3 so the default threshold (4) actually filters
them. Distinct from predatory (1) — these are merely unknown.
- Fix AttributeError in OpenAlex search engine when work has DOI key
with explicit None value: use \`work.get('doi') or work_id\` instead
of \`work.get('doi', work_id)\`. Was dropping ~14% of results per
search before they reached the filter.
- Include matching institution names in Tier 3.5 log lines so the
affiliation salvage path is debuggable.
* refactor(journal-quality): demote per-journal scoring logs to DEBUG, log institutions on score-3
* fix(openalex): handle None values for display_name, id, source.id
OpenAlex routinely returns these keys with explicit null values, which
bypassed the dict.get default and crashed downstream string operations
(slicing, split). Same antipattern as the 'doi' fix in
|
||
|
|
b516e5fe34 |
refactor: delete 6 dead files + 17 test files in advanced_search_system (#3184)
Verified via codebase-wide grep (zero production imports for each): Source files deleted: - query_generation/adaptive_query_generator.py - orphaned query generator - source_management/diversity_manager.py - orphaned diversity system - search_optimization/cross_constraint_manager.py - orphaned clustering - constraint_checking/intelligent_constraint_relaxer.py - orphaned relaxer - evidence/requirements.py - exported but never used - answer_decoding/browsecomp_answer_decoder.py - exported but never instantiated Also deleted 17 corresponding test files and updated __init__.py exports. |
||
|
|
f4fad9196c |
refactor: delete dead entity_aware_source_strategy + clean stale conftest (#3205)
* refactor: delete dead entity_aware_source_strategy + clean stale conftest entries Verified: EntityAwareSourceStrategy has zero production usage - not in search_system_factory.py, not in strategies/__init__.py, not imported by any other strategy. Only referenced in source_based_strategy.py docstring comments. Also cleaned 4 stale entries from tests/strategies/conftest.py STRATEGY_IMPORTS list for strategies already deleted or being deleted. * docs(notes): rewrite pr-3205 notes — reference git, don't duplicate Notes are commentary on the code that lives in git, not a mirror of it. Drop the verbatim prompt blocks and the NER code snippet; keep a short prose summary per novel idea plus a pointer to PR #3205 for the pre-deletion code. Net effect: LOC down, density up. Someone who wants the exact EntityAwareQuestionGenerator prompts can `git show 032b22232^:src/...` or read the PR diff. |
||
|
|
3c66fa0ec3 |
feat: add strategy-deletion documentation hook (#3529)
* feat: add strategy-deletion documentation hook Any commit that deletes a .py file under src/local_deep_research/advanced_search_system/strategies/ now requires adding or updating a .md file under docs/strategies/deleted/ in the same commit. This preserves novel prompts, heuristics, and thresholds before they disappear from the living tree. The hook exempts __init__.py and base_strategy.py (infra, not strategies) and reads git diff --cached directly so it catches deletions (pre-commit's default file list omits them). docs/strategies/deleted/README.md explains the convention and includes a file template. Existing deleted strategies aren't retroactively flagged — the hook is forward-looking. * refactor(hook): broaden scope to entire advanced_search_system tree The hook now triggers on deletions of any .py file under src/local_deep_research/advanced_search_system/ — not just under strategies/. Question generators, constraint checkers, filters, candidate explorers, and other components under that tree also carry novel ideas worth documenting before deletion. Exempt list narrowed to __init__.py aggregators only. base_*.py files are NOT exempt: deleting a base class is a significant refactor that deserves a notes file. README updated to reflect the broader scope. * refactor(hook): handle rename-out-of-scope, case-insensitive exempt, inline template Three improvements based on AI code review on PR #3529: 1. Rename-out-of-scope is now treated as a deletion. `git mv` on a strategy/question-generator out of src/local_deep_research/advanced_search_system/ removes it from the tracked module even though the file survives elsewhere; the hook now catches that case. Renames *within* the scope are legitimate refactors and continue to pass. Copies (C status) are also handled cleanly — a copy leaves the original, so it doesn't count as a deletion. 2. The exempt check now lowercases filenames, defending against case- insensitive filesystems (macOS, Windows) where __Init__.py would otherwise slip past. 3. The blocking error message now prints the full notes-file template and a short checklist of what "Novel ideas preserved here" should contain. Previously the hook said "see README"; now the developer can copy the skeleton directly from the terminal output and start filling it in without context-switching. No behaviour change for the common path (simple deletion of a .py file in scope without a notes file still blocks with the same exit code). * docs(hook): notes should reference git, not duplicate it Rewrite the README and the inline error-message template so the notes convention is clearly "commentary on the code in git" — not a mirror of it. Before: authors were asked to paste verbatim prompts, numeric constants, hardcoded lists, and heuristic recipes into the notes file. That re-hosts what git already stores permanently and makes the notes files long and tedious. After: each novelty bullet is 1-2 sentences explaining what the component did that was different from the successor, why the difference was interesting, and whether it was validated. Readers who want the exact prompt follow the deletion PR link or `git show <sha>:<path>`. The hook template and error message both explicitly warn against pasting code blocks. The README rewritten around the "reference, don't duplicate" principle with a worked example of the intended shape. |
||
|
|
bab0f61b66 |
chore(hooks): require UtcDateTime in migrations too (#3523)
Tighten check-datetime-timezone so the UtcDateTime rule applies to both models and migrations. Supersedes the inverted approach in #3515, which tried to accept sa.DateTime(timezone=True) inside migrations. - Rewrite the AST walker: handle sa.Column / bare Column, positional type arg at any index, bare Column(UtcDateTime) without parens (the hook's own example), and ast.IfExp with both branches inspected independently so a violation in either arm is still flagged. - Anchor the path filter on src/local_deep_research/ to stop false-positives on tests/database/models/ and partial-name matches like database/models_backup/. - Update .pre-commit-config.yaml name/description and the stale CI_CD_INFRASTRUCTURE.md hook table entry. - Add tests/hooks/test_check_datetime_timezone.py with 20 cases: violations (models / migrations / conditional types / batch runs / bare names), allows (UtcDateTime with import, combo import order, empty / syntax-error files), and path-filter boundaries. |
||
|
|
285eb07fb7 |
fix(journal-reputation): sync stale threshold default 4 → 2 (#3524)
Two sites still document / read the legacy default of `4` even though the authoritative default in `src/local_deep_research/defaults/ default_settings.json` has been `2` since the journal-quality redesign (PR #3081 family) lowered it. - `docs/CONFIGURATION.md:534`: table cell documented default `4`; corrected to `2` and added the "drops predatory (score 1) only" note already used in `docs/journal-quality.md` and the JSON description. - `advanced_search_system/filters/journal_reputation_filter.py:72`: `get_setting_from_snapshot("search.journal_reputation.threshold", 4, ...)` — the fallback is effectively unreachable in production (settings are seeded from `default_settings.json` on first-run), but the mismatch was misleading to readers and would silently change filter behavior for any caller that bypasses the snapshot. |
||
|
|
ce0fdf2fdd |
chore(python): bump supported floor from 3.11 to 3.12 (#3518)
## Root cause of PR #3480 failure The weekly PDM update bot (`update-dependencies.yml`) ran on Python 3.x (latest, currently 3.13/3.14) while the project declared `requires-python = ">=3.11,<3.15"`. PDM's resolver evaluates candidates against the interpreter it's running on, not the project's `requires-python` floor. That let the bot pick packages that recently dropped 3.11 support: - arxiv 3.0.0 (requires >=3.10, breaks on 3.11 install attempt) - rich 15.0.0 (requires >=3.9.0 per new metadata) - virtualenv 21.2.4 (dropped 3.11) - importlib-resources 7.1.0 (requires >=3.10) The resulting `pdm.lock` was valid on 3.13 but would fail to install on 3.11/3.12, so a downstream `pdm lock --check` caught the mismatch and the bot PR needed a manual `pdm lock` follow-up commit. A prior attempt (PR #3507) tried to patch this with `pdm lock --refresh` in the bot — that only rewrites hashes; it can't un-pick packages that violate the floor. The real fix is to align the resolver's interpreter with the `requires-python` floor. ## What this PR does 1. **Raises the floor to 3.12** in `pyproject.toml` (`requires-python`, `[tool.mypy] python_version`). Python 3.11 goes EOL Oct 2027 and ecosystem packages are already dropping it; 3.12 has the largest PyPI install share (~30%) and upstream support through Oct 2028. 2. **Pins the bot runner to '3.12'** (was `3.x`) — resolver now runs at the floor, guaranteeing chosen versions install across the whole supported range. 3. **Bumps all other CI workflows from 3.11 → 3.12** so they stay at or above the new floor (17 workflows). 4. **Regenerates `pdm.lock`** under Python 3.12 — this naturally drops pins of packages whose new versions require >3.11. Net: 1003 lines removed (no more 3.11 wheel entries). 5. **Updates docs**: `docs/developing.md` prereq, `docs/SQLCIPHER_INSTALL.md` Dockerfile snippet. ## Breaking change Users on Python 3.11 can no longer `pip install local-deep-research`. Python 3.11 users should upgrade to 3.12+ before taking future releases. ## Replaces Closes #3507 (the `pdm lock --refresh` band-aid). |
||
|
|
d18887df24 |
fix(auth): atomic post-login settings + regression test, supersedes #3487 (#3502)
* fix(auth): atomic settings reload + app.version update on login Previously, the post-login settings-version-mismatch path committed twice: once after load_from_defaults_file() wrote ~498 default setting rows, and again after update_db_version() wrote the app.version marker. app.version is NOT in default_settings.json — it is only ever written by update_db_version(). Any failure between the two commits (crash, lock timeout, engine dispose mid-transaction) left app.version unwritten, so db_version_matches_package() kept returning False and every subsequent login re-ran the 498-row bulk insert. This is the "sticky loop" that made container restarts ineffective for the reported login-hang-after-idle symptom. Changes: 1. SettingsManager.update_db_version now accepts commit=True (default, backward-compatible). Passing commit=False stages the version row in the session but does not commit, so the caller can combine it with other writes into one atomic transaction. 2. _perform_post_login_tasks step 1 now uses that flag to run load_from_defaults_file + update_db_version in a single session.commit() at the end. Either both persist or neither does — no more partial state. Test plan: - Existing test_update_db_version tests still pass (default commit=True preserves the old behaviour). - New test_update_db_version_commit_false verifies that passing commit=False stages the row but does not call session.commit(). Part of the login-hang series. Independent of the other PRs. * test(auth): lock in post-login atomicity + dispose-survival invariants Follow-on to the atomic settings reload in the previous commit. Three load-bearing properties are now guarded by regression tests and in-code invariants: 1. Mid-write failure rolls back to a clean pre-write state — the next login retries fresh instead of entering the sticky loop that PR #3487 tried to prevent with a speculative dispose skip guard. 2. Happy-path atomic block restores both defaults and `app.version` together. 3. `engine.dispose()` does NOT break a thread holding a checked-out connection — SA 2.0's documented contract (`QueuePool.dispose` drains only idle entries, `Engine.dispose` calls `pool.recreate()`). 20-iteration stress test against a real SQLCipher+WAL engine. Also: - Strengthened the comment on the post-login atomic block (`routes.py`) as an explicit ATOMICITY INVARIANT: splitting into two commits regresses to the sticky loop. - Documented the caller contract for `load_from_defaults_file` and `update_db_version` (`settings/manager.py`): pass `commit=False` and own the terminal commit yourself. - Rewrote the dispose-loop comment in `connection_cleanup.py` to record the SA 2.0 safety argument, so nobody re-adds a `checkedout() > 0` skip guard without a real reproducer (see PR #3487 discussion). - Added ADR-0004 addendum summarising the PR #3487 investigation and pointing at the regression guard. No change to `connection_cleanup.py` logic — dispose remains unconditional. Supersedes PR #3487. |
||
|
|
bc3680d21c |
docs: update pool-sizing comments, FD calculations, and create ADR-0004 (#3477)
Follow-up to the NullPool removal in
|
||
|
|
37a87297c3 |
docs: fix stale pool-size comments and NullPool references after #3441 (#3462)
PR #3441 removed per-thread NullPool engines and changed pool_size from 10→20 / max_overflow from 20→40, but several comments and docs still referenced the old values and removed infrastructure. - Update pool_size/max_overflow numbers in encrypted_db.py comments - Remove dead ADR-0004 path reference (file never existed) - Remove redundant has_per_database_salt() warning that fired on every cache-hit call (open_user_database already covers cache miss) - Fix NullPool reference in processor_v2.py comment - Fix stale thread_engines metric doc in connection_cleanup.py - Fix stale dead-thread engine sweep comment in connection_cleanup.py - Update architecture.md flowchart, FD budget table math (21→41, 81→121), and key files table roles - Update troubleshooting.md sweep description |
||
|
|
8e11dcf729 |
refactor(db): remove per-thread NullPool engines to fix FD leak (#3441)
Previously DatabaseManager kept a dedicated per-(username, thread_id) NullPool engine in `_thread_engines` for background-thread metric writes, alongside the per-user QueuePool engine in `connections`. Orphaned entries leaked SQLCipher+WAL file handles (3 FDs per active connection) when @thread_cleanup did not fire, eventually exhausting the 1024 FD soft limit and causing werkzeug's per-request selector to fail on every request. Route metric writes through the shared per-user QueuePool engine, which is already created with check_same_thread=False and is safe to use from background threads. FD usage is now bounded by pool_size + max_overflow per user instead of scaling with background thread count. Also: - Bump pool_size=20, max_overflow=40, add pool_timeout=10 to absorb concurrent research + HTTP + metric writers against the shared pool. - Add pool_checked_out observability to the periodic Resource monitor. - Delete ~200 lines of thread-engine bookkeeping: cleanup_thread_engines, cleanup_dead_thread_engines, maybe_sweep_dead_engines, cleanup_all_thread_engines, _sweep_lock, _last_sweep_time, _thread_engine_lock, _thread_engines. - Force QueuePool on the SQLCipher integration-test fixture so concurrent-write tests exercise real pooling (not StaticPool). - Update docs/architecture.md and web/database/README.md. Known follow-up: parallel_constrained_strategy.py uses max_workers=100 which could spike pool pressure under worst-case load; sessions are short-lived so sustained contention is unlikely, and pool_timeout=10 will surface it as errors rather than deadlock. 1996 passed, 8 skipped across tests/database and tests/web/auth. |
||
|
|
061cd83dd4 |
feat: add is_lexical flag to auto-enable LLM relevance filtering for keyword-based engines (#3403)
* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia, GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via a new `needs_reranking` class attribute. This fixes the priority bug where the global `skip_relevance_filter=True` incorrectly overrode auto-detection for engines that genuinely need filtering. Priority is now: per-engine setting > needs_reranking > global skip. The global skip only affects unclassified engines. Closes #2297 * fix: address 7 code-review issues on needs_reranking branch 1. Rename needs_reranking → needs_llm_relevance_filter for consistency with enable_llm_relevance_filter and skip_relevance_filter naming 2. Fix Paperless dead code: replace non-existent _apply_content_filters with proper _filter_for_relevance() call in custom run() override 3. Fix misleading skip_relevance_filter description to accurately reflect checkbox behavior and keyword engine exceptions 4. Delete 4 vacuously-true inline tests that duplicated factory logic instead of calling the real factory (coverage tests already exist) 5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md 6. Clarify is_generic comment: generic does not imply good ranking 7. Upgrade no-LLM log from debug to warning when filtering was requested but no LLM is available (with should_filter guard) * fix: remove Paperless fallback that overrode valid empty LLM filter results Replace the fallback that restored all previews when the LLM filter returned empty with an info log. The base class _filter_for_relevance() already handles errors internally (returns previews[:5] on exception or JSON parse failure). An empty result means the LLM legitimately found nothing relevant — trust it, don't override it. * refactor: rename needs_llm_relevance_filter → is_lexical The flag describes what the engine IS (lexical/keyword-based search) rather than what it needs. This is a general classification that can drive multiple behaviors beyond just the relevance filter — e.g. query optimization strategies, result deduplication, or UI hints. Matches the existing is_* naming pattern (is_scientific, is_generic). * Revert "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commit |
||
|
|
83f632e069 |
fix: treat empty environment variables as unset to fix provider selection (#3362)
* fix: treat empty environment variables as unset to fix provider selection When deploying via Docker/Unraid templates, all environment variables are created even when left blank (e.g. LDR_LLM_ANTHROPIC_API_KEY=""). The check_env_setting() function previously treated these empty strings as valid overrides, which caused provider settings to be blanked out and prevented proper provider selection on fresh installs. Empty env vars are now treated as unset, allowing database defaults to take effect normally. Fixes #3339 * fix(tests): update test to match empty env var behavior Update test_env_override_empty_string to assert that empty environment variables are treated as unset (returning DB value) rather than overriding with empty string. This aligns with the fix for #3339. * docs: add ecosystem context for empty env var handling decision Document that treating empty environment variables as unset is standard practice across major projects (botocore, viper, Turborepo, Go stdlib, Docker Compose) with references to the PR discussion. * feat: add warning log for empty env vars, fix references, add tests and docs - Log warning when empty env vars are detected (helps users diagnose Unraid/Docker template issues) - Replace misleading viper/Docker Compose references with CPython official docs and Pallets/Click PR #2223 - Add unit tests: empty string returns None, warning is logged, provider/model/multiple keys handled - Add integration tests: empty string with no DB value, checkbox, number settings - Document empty env var behavior in unraid.md, docker-compose-guide.md, and env_configuration.md * docs: recommend DISABLED instead of Web UI for blocking settings Users can set env vars to a non-empty invalid value like "DISABLED" to explicitly block a key, which is simpler than navigating the UI. |
||
|
|
0ad4529b7e |
chore: auto-bump version to 1.5.6 (#3364)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
7131d82596 |
chore: auto-bump version to 1.5.3 (#3345)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
69bd0c67de |
chore: auto-bump version to 1.5.2 (#3333)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
b608714698 |
chore: auto-bump version to 1.5.1 (#3320)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
1cd3f18250 |
chore: auto-bump version to 1.5.0 (#3071)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
75467eee13 |
docs: ADR-0003 reject universal raise-without-from enforcement (#3266)
* docs: add ADR-0003 rejecting universal raise-without-from enforcement Document the decision to reject PR #3225's check-raise-without-from hook. Enforcing raise...from e everywhere conflicts with the codebase's existing PII protection strategy: check-sensitive-logging and fix-exception-logging hooks prevent exception chain leakage, and several files intentionally break chains when wrapping user-facing exceptions. * docs: add exception handling policy to pre-commit hooks directory Quick-reference guide for developers working near the hooks, covering when to use raise...from e, from None, or omit chaining entirely. Cross-references ADR-0003. * docs: add ADR-0003 references to exception logging hooks Add inline notes to check-sensitive-logging.py and fix-exception-logging.py explaining why raise...from e is not enforced universally — exception chains would re-expose the PII these hooks are designed to strip. * docs: remove redundant EXCEPTION_POLICY.md The inline docstring comments in check-sensitive-logging.py and fix-exception-logging.py plus ADR-0003 already cover this. No other hooks have companion markdown files in .pre-commit-hooks/. |
||
|
|
81a5498e77 |
docs: add ADR-0002 documenting pre-commit hook review decisions (#3251)
* docs: add ADR-0002 documenting pre-commit hook review decisions Document the batch review of 9 pre-commit hook PRs (#3218-#3231): 4 accepted, 5 rejected with specific technical rationale. Add rejected-hooks comment block to .pre-commit-config.yaml linking to the ADR, preventing re-proposals without new information. * docs: add Related PRs table to ADR-0002 for quick navigation * fix(docs): address review findings in ADR-0002 - Fix hook count: 43 (not 38), increases to 45 (not 41) - Remove redundant date from title - Move Related PRs to end of document (after Consequences) - Fix principles: narrow file-scoped and non-duplicative rules, reframe AI-aware as identifier false-positive concern, add self-contained principle - Make codespell#196 a proper hyperlink - Backtick-quote raise...from as code |
||
|
|
819fafe8c2 |
feat: add automatic database backup system (#3006)
* feat: add automatic database backup system with review fixes Adds encrypted database backups triggered on login, based on PR #2565 with critical fixes from code review applied. New backup module: - BackupService: encrypted backups via sqlcipher_export(), atomic rename, per-user locking, disk space validation, backup verification - BackupScheduler: singleton with ThreadPoolExecutor (max 2 workers), non-blocking background backup, atexit shutdown - Configurable via settings: backup.enabled, backup.max_count (3), backup.max_age_days (7) Review fixes applied (not in original PR): - Add PRAGMA busy_timeout = 10000 to prevent instant failure on concurrent writer lock contention - Use settings defaults (or 3/7) instead of raising ValueError when backup settings are missing (djpetti's review feedback) - Integrate into _perform_post_login_tasks background thread pattern - Add stale .tmp file cleanup in _cleanup_old_backups - Fix stat() TOCTOU in cleanup loop with FileNotFoundError handling - Enforce directory permissions with os.chmod after mkdir - Use safe_close() instead of bare .close() in finally blocks - Fix .gitignore to not ignore backup source code Includes 94 tests (4523 lines) and security documentation. * fix: update key derivation API and add crash recovery tests - Replace _get_key_from_password (private, old 1-arg API) with get_key_from_password (public, with db_path for per-DB salt) to match current main's key derivation interface - Add 3 end-to-end crash recovery tests using real SQLCipher: 1. Full round-trip: backup, delete original, open backup, verify all rows and integrity_check pass 2. Wrong password rejection: backup can't be decrypted with wrong key 3. Encryption verification: backup file has no plaintext SQLite header - Tests skip when SQLCipher is not installed (CI Docker image has it) * feat: purge old-key backups on password change + 9 new tests Security fix: after a password change, old backups remain encrypted with the old (potentially compromised) password. Per NIST SP 800-57, OWASP A02, and patterns from VeraCrypt/Bitwarden/Signal, old backups should be purged and replaced with a fresh backup using the new key. Changes: - Add BackupService.purge_and_refresh() method that deletes all existing backups and creates a fresh one with the current password - Integrate into change_password route (auth/routes.py) - Add empty-file check to _verify_backup (0-byte files were passing) - Add gitleaks allowlist entry for auth/routes.py New tests (9): - TestPasswordChangeBackupSecurity (3 real SQLCipher tests) - TestBackupCorruptionDetection (3 real SQLCipher tests) - TestBackupRetentionEnforcement (3 mocked tests) * test: rewrite crash recovery test with correct SQLCipher connection API Fixes from 6-agent verification round: - Use create_sqlcipher_connection() instead of manual connect+key+pragmas - Wrap wrong-password checks in pytest.raises around connection factory - Add @pytest.mark.timeout(120) for CI stability - Add encryption header check for fresh backup after purge_and_refresh - Use inline patches, fix docstring step count * test: add 15 more backup system tests New test classes: - TestBackupDiskSpaceAndAtomicity (3): missing source DB, atomic rename pattern, size_bytes accuracy - TestBackupFilePermissionsExtended (1): backup file 0o600 mode - TestPurgeAndRefreshEdgeCases (6): no existing backups, multiple old backups, .tmp cleanup, list ordering, get_latest edge cases - TestBackupServiceInitValidation (3): boundary values for max_backups, max_age_days, empty username * feat: reduce backup defaults and add pre-migration backup - Change max_backups default from 3 to 2 and max_age_days from 7 to 2 to reduce disk usage for databases with large PDF BLOBs while keeping a safety net against corruption overwriting the only backup. - Add synchronous pre-migration backup in open_user_database() that triggers before Alembic migrations run. Only fires when needs_migration() returns True (version upgrades), not on every login. Backup failure is logged as error but does not block migration. * fix: use get_setting default parameter for backup.enabled The expression `sm.get_setting("backup.enabled") or True` always evaluates to True (False or True == True), making it impossible for users to disable backups. Use the get_setting default parameter instead, which is the established pattern throughout the codebase. * fix: address review findings from 6-round 30-agent review Critical fixes: - Fix _verify_backup() salt mismatch: pass db_path=self.db_path to set_sqlcipher_key so backup verification uses the correct per-database salt instead of the legacy salt. Without this, all v2 database backups fail verification and are silently deleted. - Fix purge_and_refresh() race condition: hold per-user lock for the entire purge+create operation to prevent a concurrent backup from writing an old-key backup between purge and fresh backup creation. - Fix DETACH not in finally: wrap DETACH DATABASE in its own finally block so the attached backup file is always released even if sqlcipher_export() raises. Remove no-op conn.commit() after DETACH. Important fixes: - Fix _cleanup_old_backups/list_backups/get_latest_backup TOCTOU: use safe_mtime helper that catches FileNotFoundError in sort key lambda. - Fix list_backups timezone: use tz=UTC consistent with codebase. - Fix get_backup_scheduler() thread safety: remove redundant module- level singleton; rely on thread-safe __new__. - Fix docs: replace VACUUM INTO with sqlcipher_export() throughout. - Fix test_no_raw_sql.py: add backup_service.py to skip list. - Fix test readonly dir: skip when running as root in Docker. * fix: address djpetti review + add 6 high-value backup tests Review feedback (djpetti): - Restore max_age_days default to 7 (2 days was too aggressive — a weekend gap would delete all backups) - Replace `or 2`/`or 7` fallbacks with `get_setting(key, default)` which is the established codebase pattern (30+ uses) - Keep max_backups=2 for disk space savings New integration tests (real SQLCipher, in test_backup_crash_recovery_ci.py): - test_backup_preserves_all_schema_objects: compare sqlite_master - test_backup_passes_foreign_key_check: PRAGMA foreign_key_check - test_restored_backup_accepts_new_writes: INSERT/UPDATE + durability New unit tests (mocked, in test_backup_service.py): - test_backup_created_when_migration_needed - test_no_backup_when_no_migration_needed - test_migration_proceeds_when_backup_raises * feat: limit backups to one per calendar day to prevent corruption propagation A corrupted database that overwrites all backups via rapid login cycles is the primary risk for a 2-backup rotation. Now create_backup() skips if a backup with today's date prefix already exists in the backup dir. Exceptions that always create a backup regardless: - Pre-migration backups (force=True) — schema changes are the highest risk moment and must always have a safety net - purge_and_refresh() on password change — calls _create_backup_impl() directly, bypassing the daily check (security requirement) * fix: sort daily backup glob + wrap DETACH in try/except - Use max(existing_today, key=lambda p: p.name) instead of existing_today[0] for the daily backup limit check, since glob() returns results in arbitrary filesystem order. - Wrap DETACH DATABASE in try/except inside the finally block to prevent masking the original sqlcipher_export exception if DETACH also fails. * fix: check purge_and_refresh result instead of logging unconditional success The return value of svc.purge_and_refresh() was discarded, so a failed fresh backup after password change logged "Backups refreshed" falsely. Now checks result.success and logs error if backup creation failed, making it visible that the user has zero backups after purge. * test: add daily backup limit tests + add missing warning log New tests (TestDailyBackupLimit, 3 tests): - test_skips_when_backup_exists_for_today: verify create_backup skips when a backup with today's date already exists - test_force_bypasses_daily_limit: verify force=True enters _create_backup_impl even when today's backup exists - test_proceeds_normally_for_different_day: verify yesterday's backup doesn't trigger the daily skip Also: add logger.warning for failed .tmp file deletion in purge_and_refresh (was silently swallowed with bare except pass). * docs: add disk space warning and disable instructions to backup settings Update backup.enabled description to mention disk usage and how to disable. Update docs with clearer disk space guidance noting that backups can be disabled via settings if space is limited. * fix: reduce default max_backups from 2 to 1 Encrypted backups cannot be compressed (AES-256 has maximum entropy), so each backup equals the full database size. With large databases containing PDFs (100s of MB), keeping 2 backups doubles disk usage. The daily backup limit already prevents the corruption-overwrite scenario that was the original justification for 2 backups. Users who want extra safety can increase max_backups in settings. * feat: add backup status warnings to research page Add two dismissable warnings to the existing warning system: - "Database Backups Disabled" when backup.enabled is False - "No Backups Found" when enabled but none exist yet Uses the existing warning_checks infrastructure (yellow alert boxes on the research page). Backup check uses a lightweight filesystem glob — no password or encryption needed. Removes flash-based approach from login (research page doesn't render flash messages). |
||
|
|
0ea808fb04 |
chore: auto-bump version to 1.4.0 (#2714)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
86898fb071 |
feat: give MCP agent control over sub-research iterations and search engine (#3067)
* feat: give MCP agent control over sub-research iterations and search engine
The MCP agent can now configure sub-research tools with optional parameters:
- **iterations** (1-15): Control how many search rounds the sub-strategy
runs. Use fewer (2-3) for quick checks, more (10-15) for exhaustive
research. Previously hardcoded to 8 for focused-iteration.
- **search_engine**: Override which search engine the sub-research uses.
For example, the agent can now run `focused_research` with
`search_engine: "pubmed"` to do deep iterative research specifically
against PubMed, or `"arxiv"` for scientific papers. Previously,
sub-research always used the primary configured engine (usually SearXNG),
losing access to specialized databases.
This lets the agent make smarter delegation decisions, e.g.:
ACTION: focused_research
ARGUMENTS: {"query": "mRNA cancer vaccine clinical trials",
"iterations": 5, "search_engine": "pubmed"}
The system prompt is updated to teach the agent about these options.
Override engines are properly cleaned up after use.
* refine: prefer more iterations over breadth, raise limit to 25
Updated system prompt and tool descriptions to guide the agent toward
using more iterations (10-20+) rather than broader queries. Raised the
iteration clamp from 15 to 25 to support exhaustive deep dives.
* docs: note web_search gives precise control over search queries
Highlight in the system prompt that web_search is also useful for
crafting specific search queries with exact phrases, date ranges, or
site filters — giving the agent more direct control over what reaches
the search engine.
* refactor: simplify MCP tools — web_search default, only focused_research for deep work
Removed source_research and quick_research tools from the MCP agent.
These were legacy strategies that added confusion without clear value:
- source_research (source-based) overlapped with focused_research
- quick_research (standard) was an old strategy with no clear advantage
The agent now has a cleaner toolset:
- web_search: primary tool for most queries, gives precise control
- focused_research: deep iterative research with configurable iterations
and search_engine override (e.g. pubmed, arxiv)
- search_[engine]: quick single queries against specific databases
- download_content: fetch full article text
System prompt rewritten to emphasize web_search as the default action
and focused_research for complex topics needing depth.
* prompt: recommend specialized engines for domain-specific questions
Added explicit guidance in the system prompt mapping question domains
to the best engine: medical → pubmed, scientific → arxiv, background →
wikipedia, news → wikinews. This helps the agent pick the right tool
instead of defaulting to web_search for everything.
* fix: increase observation limits 10x so agent sees full sub-research output
OBSERVATION_MAX_LENGTH: 5000 → 50000
HISTORY_OBSERVATION_MAX_LENGTH: 1000 → 10000
The previous limits severely truncated focused_research output — the
agent would run 8+ iterations of deep PubMed research but only see
1000 chars of the result when synthesizing the final answer. Now the
agent retains much more context from sub-research for better synthesis.
* feat: enable semantic_scholar and openalex for auto search by default
Added use_in_auto_search=true to both settings files so the MCP agent
can use them as specialized search tools. Updated system prompt to
mention them alongside arxiv and pubmed.
* fix(test): update tool count after removing source_research and quick_research
* prompt: guide agent to do quick searches first before focused_research
* prompt: recommend search_arxiv as quick first step for scientific topics
* fix: address all 8 review rounds — prompt, tests, docs, limits, progress
- Rewrote system prompt: one clear decision table, no contradictions
- Fixed docstring (1-15 → 1-25), progress message shows actual engine
- Added comments explaining MCP is for large-context LLMs (32k+)
- Added 10 tests for iterations/search_engine/clamping/cleanup
- Updated docs/mcp-server.md for removed tools
|
||
|
|
40b885e2f2 |
feat: add semantic search over research history (#1475)
* feat: add semantic search over research history
Add a Research History collection that indexes all research reports
and sources for semantic search:
- Add ResearchDocumentLink model to track research → document mappings
- Add research_report and research_source source types
- Create ResearchHistoryIndexer service for on-demand indexing
- Add API endpoints for collection status, indexing, and search
- Add semantic search UI panel on history page with progress tracking
- Add "Save to Collection" feature on results page
The indexer creates Document entries from research reports (markdown)
and sources with content, links them to the Research History collection,
and triggers FAISS indexing for semantic search.
Users can also save individual research to custom collections via
the new "Save to Collection" button on the results page.
* fix: add research_document_links to expected tables
Document new table added for semantic search over research history.
* fix: address review findings for research history semantic search
- Fix exception detail leaks in 5 error handlers (CodeQL flagged)
- Fix N+1 query in search (60 queries → 1 batch join query)
- Fix type matching bug: use ResearchDocumentLink.link_type instead
of broken UUID string matching on source_type_id
- Fix memory issue: use count() + yield_per(50) instead of .all()
- Extract get_rag_service() to rag_service_factory.py to fix circular
import (service → routes)
- Add LinkType enum for ResearchDocumentLink.link_type column
- Replace magic strings with constants in indexer
- Wrap history_search.js in IIFE, use shared window.escapeHtml
- Remove duplicate escapeHtml and unused isSafeUrl from save_to_collection.js
- Add 9 tests (5 indexer service + 4 route tests)
* fix: address pre-commit failures and review findings for semantic search
Blocking fixes:
- Remove unused `to_bool` import (pre-commit failure)
- Remove redundant `{e}` from logger.exception() calls (pre-commit)
- Add CSRF token to POST fetch in history_search.js
- Escape API error messages in innerHTML (XSS)
- Guard division by zero in streaming progress when total==0
- Fix getResearchIdFromUrl() to use URLBuilder.extractResearchIdFromPattern
instead of URLSearchParams (route is /results/<uuid>, not ?id=)
- Fix JS field name mismatch: documents_created→documents_added,
documents_linked→sources_indexed to match API response
Important fixes:
- Use LinkType.SOURCE enum instead of string literal "source"
- Use standard CSRF pattern (window.api.getCsrfToken) in save_to_collection
- Add warning log in streaming method for failed indexing (matches non-streaming)
- Remove dead `seen_research_ids` set (populated but never used for filtering)
- Fix similarity score formula: use distance directly for IndexFlatIP (cosine),
1/(1+distance) only for L2
* fix: address code review findings for research history semantic search
Bug fixes:
- Use LinkType.SOURCE enum instead of "source" string literal (indexer:552)
- Use LinkType.REPORT enum instead of "report" string literal (rag_routes)
- Strip exception text from error messages to prevent info leakage (indexer:262)
- Fix distance-to-similarity formula for cosine/IP metric (rag_routes)
- Pass db_password through rag_service_factory to LibraryRAGService
- Add session password retrieval to rag_routes get_rag_service wrapper
Security (XSS):
- Escape data.error in innerHTML (history_search.js:236)
- Replace onclick injection with addEventListener pattern (save_to_collection.js)
- Remove window.saveToCollection global exposure
Hygiene:
- Export ResearchDocumentLink and LinkType from models/__init__.py
- Validate limit param as int with bounds [1, 50] (rag_routes)
- Remove duplicate sqlcipher_utils.py entry in .gitleaks.toml
- Remove dead code: seen_research_ids set (populated but never read)
- Fix redundant exception vars in logger.exception() calls
* fix: address round 2 code review findings for semantic search
- Add None guard for chunk_text in search results (crash fix)
- Remove redundant rag_service.db_password override in index_collection
- Add @require_json_body to 3 new POST endpoints (CSRF mitigation)
- Guard research.query against None/empty at all title derivation sites
- Export RagDocumentStatus from database models __init__.py
* fix: address review findings rounds 7-9 for semantic search
- Fix URL pattern: use URLBuilder.resultsPage() instead of query param (404 fix)
- Fix XSS: replace onclick inline handler with event delegation in save_to_collection
- Fix similarity scoring: use (distance+1)/2 mapping for cosine [-1,1] to [0,1]
- Escape dateStr fallback in history_search for defense-in-depth
- Use LinkType.REPORT enum instead of string literal
- Add @require_json_body to add_research_to_collection and search_research_history
* fix: address round 3 code review findings for semantic search
- Add read-only get_collection() to ResearchHistoryIndexer so search
route doesn't create collections/indexes as a side effect
- Fix force=True being silently ignored when docs already in collection
by populating doc_ids_to_index from existing_doc_ids
- Fix escapeHtml fallback treating 0/false as empty string (match
xss-protection.js null/undefined check + String() coercion)
- Escape research_created_at in catch branch of date formatting
* fix: address review findings rounds 10-11 for semantic search
- Remove @require_json_body from index_single_research (body is optional)
- Remove duplicate LinkType import in search_research_history
- Add query length limit (10000 chars) before embedding call
* fix: address 9 blocking review findings for semantic search PR
Database:
- Revert Document.resource_id FK from CASCADE back to SET NULL (prevents
destructive deletion of documents when a ResearchResource is removed)
- Restore foreign_keys="[Document.resource_id]" on Document.resource
relationship (required for SQLAlchemy disambiguation with bidirectional FKs)
Backend (research_history_indexer.py):
- Replace yield_per(50) with materialized ID queries in both
index_all_research() and index_all_research_streaming() to avoid
SQLite/SQLCipher locking from nested session writes on an active cursor
- Propagate force parameter to _index_documents_to_rag() so force=True
actually triggers re-indexing in FAISS
- Replace deprecated session.query(Model).get() with session.get()
(required for SQLAlchemy 2.0 compatibility)
- Check for existing Document by hash before inserting to avoid
IntegrityError on unique document_hash constraint (reuses document
when same content is re-indexed)
- Use len(content.encode("utf-8")) for accurate byte-size file_size
Frontend:
- Escape result.similarity and result.research_id in innerHTML to
prevent XSS in history_search.js
- Add response.ok checks before response.json() on all 4 fetch calls
across history_search.js and save_to_collection.js
* fix: address round 2 review findings for semantic search PR
Critical:
- Normalize query vector (L2) before FAISS search to match indexed
vectors — langchain's FAISS wrapper normalizes during indexing but
the route bypassed the wrapper, producing wrong similarity rankings
High:
- Guard against IntegrityError when hash-reused Document already has a
ResearchDocumentLink (unique=True on document_id) or DocumentCollection
entry — check for existing rows before inserting
- Remove user-supplied collection_id from error message to prevent
information disclosure
Medium:
- Add ResearchDocumentLink cleanup to delete_document_completely() since
SQLite FK cascades are not enforced (no PRAGMA foreign_keys = ON)
- Move get_source_type_id() calls inside 'if document is None' block to
avoid unnecessary DB queries when reusing existing documents
- Use explicit content.encode("utf-8") consistently for hash computation
Low:
- Escape dateStr in innerHTML (history_search.js)
- Escape data.documents_added/sources_indexed in innerHTML
(save_to_collection.js)
* fix: reuse similarity_search_with_score instead of manual FAISS search
Replace ~80 lines of manual FAISS search (embed_query, L2 normalization,
index.search, docstore ID mapping, custom similarity formula) with
langchain's similarity_search_with_score(), matching the pattern already
used in search_engine_collection.py. Also:
- Remove dead `or {}` after @require_json_body decorator
- Fix DOMContentLoaded race in history_search.js (readyState check)
- Remove duplicate urls.js load in history.html (already in base.html)
- Add 3 tests: hash collision reuse, force=True propagation, index_all
* refactor: replace custom search route with generic collection search endpoint
Replace the 170-line search_research_history() route with a generic
search_collection() endpoint that delegates to CollectionSearchEngine.
This reuses the same search code the research pipeline uses instead of
reimplementing FAISS search in route code.
- POST /api/collections/<id>/search replaces POST /api/research-history/search
- Research metadata enrichment extracted to _enrich_with_research_metadata()
- UI: unified search bar with text/semantic mode toggle (brain icon)
- Semantic search panel slimmed to indexing controls only
- history_search.js exports semanticSearchHistory() instead of managing its own UI
* fix: address 4 integration review findings for collection search refactor
- F1: Add .ldr-btn-outline.active CSS rule so semantic toggle shows visual feedback
- F2: Fall back to source_id when document_id missing in collection search metadata
- F3: Guard window.semanticSearchHistory call with helpful loading message
- F4: Return needsIndexing sentinel when no collection indexed, show guidance UX
* refactor: remove ResearchDocumentLink model — use existing Document columns
Document already has research_id, resource_id, and source_type_id columns
that fully track which research produced which document and whether it's
a report or source. ResearchDocumentLink was a redundant junction table
duplicating these relationships.
- Remove ResearchDocumentLink model and LinkType enum from library.py
- Remove from model exports, cascade_helper, and schema stability test
- Rewrite indexer to query Document.research_id/resource_id directly
- Rewrite _enrich_with_research_metadata to join Document→SourceType
- Extract _ensure_in_collection helper to reduce duplication
- Update tests to assert on Document fields instead of link table
* docs: clarify hash-dedup constraint in _create_document_from_report
document_hash has unique=True, so identical content must share a
Document row. Add comment explaining research_id points to the
first creator in the hash-collision case.
* refactor: restore research_report/research_source as source type categories
These are seed data rows in the existing source_types table, not schema
changes. They give research reports and sources their own category via
Document.source_type_id, which is required (nullable=False).
* feat: hybrid search mode as default on history page
Replace the toggle button with a Bootstrap dropdown showing three modes:
- Hybrid (default): instant text filter + semantic results appended below
- Text Only: title/query filter, no API call
- AI Only: semantic search only
Hybrid mode shows text matches immediately, then appends deduplicated
semantic results after a 500ms debounce. Race conditions are guarded
by a hybridSearchId counter. Not-indexed state silently falls back to
text-only with no error.
* feat: tiered ranked results for hybrid search mode
Replace the two-section layout (text above, semantic below) with a
three-tier ranked list:
- Tier 1: items matching both title and content, sorted by similarity
- Tier 2: text-only matches in recency order
- Tier 3: semantic-only matches below a divider, sorted by similarity
Tier 1 cards show an AI match badge with similarity % and a 2-line
snippet preview. Tier 3 items attempt to find the full history record
for full action buttons, falling back to a simplified View-only card.
Remove renderHybridSemanticSection from history_search.js (replaced by
buildTieredResults + renderMergedResults in history.js). Split debounce
into separate input and semantic timers.
* feat: render markdown in snippet previews via marked + DOMPurify
Snippets from semantic search often contain markdown (bold, code,
emphasis). Use marked.parseInline() + DOMPurify.sanitize() to render
them as rich inline HTML instead of escaped plaintext. Falls back to
escapeHtml when libraries aren't loaded.
* refactor: extract semantic search into research_library/search/ subpackage
Backend:
- Create research_library/search/ with routes/ and services/ subdirs
- Move research_history_indexer.py to search/services/ (re-export stub
at old location for backward compatibility)
- Extract 6 research history + collection search routes from rag_routes.py
(3169→2795 lines) into search/routes/search_routes.py with own blueprint
- Register search_bp in app_factory.py
Frontend:
- Create shared semantic_search.js exposing window.SemanticSearch with:
renderSnippet, buildTieredResults, createSemanticResultCard, isSafeExternalUrl
- Create semantic-search.css with all semantic search styles (moved from
history-icons.css + consolidated inline styles)
- history.js and history_search.js now use shared module instead of
private implementations
Tests:
- Move test files to tests/research_library/search/ parallel structure
- Fix mock paths for rag_service_factory refactor across 3 test files
* fix: guard against non-array API responses and mode-change race in hybrid search
Three bugs found during 8-round architectural review:
1. (CRITICAL) semanticResults = results || [] passes non-array objects
(like error responses) to buildTieredResults, causing TypeError.
Fix: use Array.isArray() check.
2. (WARNING) Switching search mode while semantic results are in-flight
causes hybrid results to overwrite text-only render. Fix: check
searchMode === 'hybrid' in the async callback.
3. (WARNING) Semantic results with null research_id produce String(null)
= "null", collapsing to one map entry and creating broken navigation
links. Fix: skip results with falsy research_id.
* refactor: extract search mode strings into shared LDR_CONSTANTS
Add js/config/constants.js loaded globally via base.html, following
the existing urls.js pattern. Replace magic strings 'hybrid', 'text',
'semantic' in history.js with LDR_CONSTANTS.SEARCH_MODE constants.
* refactor: rename Research History collection display name to History
Users see "History" in the menu, so the collection should match.
Add RESEARCH_HISTORY_COLLECTION_NAME and description as constants
in constants.py rather than hardcoding strings.
* docs: add semantic search architecture guide with mermaid diagrams
Add docs/architecture/SEMANTIC_SEARCH.md covering:
- Indexing pipeline (ResearchHistory → Documents → FAISS)
- Search pipeline (Hybrid/Text/AI-Only modes)
- Three-tier merge algorithm
- File structure (backend + frontend)
- API routes reference
- Reusing on other pages guide
Also: populate search/services/__init__.py with ResearchHistoryIndexer
export, and add cross-references in OVERVIEW.md and DATABASE_SCHEMA.md.
* tests: add route tests for 3 untested research-history endpoints
- TestGetResearchHistoryCollectionRoute: happy path (200 + fields) and
exception (500) for GET /library/api/research-history/collection
- TestIndexResearchHistorySSERoute: happy path SSE (200 text/event-stream,
correct data: lines) for GET /library/api/research-history/index
- TestIndexSingleResearchRoute: happy path (200), error status (400), and
exception (500) for POST /library/api/research-history/index/<id>
- Tighten test_index_research_creates_documents assertion from >= 1 to == 2
(1 report + 1 source) in test_research_history_indexer.py
* fix: resolve critical bugs and pattern violations in semantic search feature
- Fix CSS never loading: change {% block styles %} to {% block extra_head %}
- Fix undefined var(--primary): replace with var(--primary-color) in 6 locations
- Fix LDR_CONSTANTS block-scoped: use window.LDR_CONSTANTS for global access
- Fix broken /library/api/collections/list URL: use correct /library/api/collections
- Centralize 5 hardcoded API URLs into URLS.LIBRARY_API constants
- Replace 6 console.error calls with SafeLogger.error
- Replace hardcoded "completed" strings with ResearchStatus.COMPLETED enum
- Add threading.Lock concurrency guard inside SSE generate() for bulk indexing
- Eliminate nested get_source_type_id sessions: inline SourceType queries (4 calls)
- Add RAG indexing failure logging and rag_warning in result dict
- Add ARIA accessibility: role=button, tabindex, aria-expanded, keyboard handler
- Replace inline onclick handlers with addEventListener
- Remove dead try/catch around EventSource constructor
- Add beforeunload cleanup for active EventSource
- Apply ruff format to llm_utils.py and settings_routes.py (pre-existing CI fix)
* fix: correct test patch targets after indexer refactoring
- Route tests: patch ResearchHistoryIndexer at definition module
(lazy imports inside function bodies aren't patchable at the route module)
- Service tests: remove all get_source_type_id patches (function was replaced
with inline SourceType queries; DB fixtures already seed the source types)
* fix: SSE lock timeout, extract inline styles, correct DATABASE_SCHEMA docs
- Add 10-minute wall-clock timeout to SSE bulk indexing to prevent DoS
via indefinite lock hold (checked each iteration in generate())
- Extract structural inline styles from history.html into CSS classes:
ldr-semantic-panel-header, ldr-indexing-status, ldr-progress-track,
ldr-progress-bar, ldr-search-row
- Fix DATABASE_SCHEMA.md: SourceType is a normalized table (not an enum)
with values: research_download, user_upload, manual_entry,
research_report, research_source
* fix: resolve test failures and edge cases from review
- Fix test_library_init.py: update source type count from 3 to 5, mock
ensure_research_history_collection in all initialize_library_for_user tests
- Initialize source_count = 0 before branch to prevent UnboundLocalError
- Filter out empty-string report_content (report_content != "") to prevent
entries from being stuck as permanently pending
- Sanitize SSE data.percent with Number() clamped to [0,100] to prevent
CSS injection via style.width
* fix: address review findings — generic errors, RAG cleanup, tests
- Remove research_id from error message (match codebase pattern)
- Wrap get_rag_service() in context manager to release resources
- Clarify force parameter docstring behavior
- Document threading.Lock process-local limitation
- Add 4 tests: SSE lock contention, add-to-collection success/404,
RAG service cleanup verification
* fix: SQLite 999 var limit, partial RAG recovery, source_type_id warnings
- Fix #3: Replace materialized set + notin_() with subquery in
get_indexing_status() to avoid SQLite SQLITE_MAX_VARIABLE_NUMBER crash
when >999 research entries are indexed; use .count() instead of len(set)
- Fix #7: In index_research() existing_docs branch, query full
DocumentCollection objects to inspect the indexed flag; queue docs
that are in the collection but have indexed=False for re-indexing;
update early-return guard to also check doc_ids_to_index
- Fix #14: Add logger.warning() in _create_document_from_report() and
_create_document_from_source() when the SourceType row is not found
* fix: per-user lock, SSE safety, race guards, data contracts, and UX bugs
- Per-user index lock instead of global lock (#1)
- Use response.call_on_close for lock release instead of generator finally (#2)
- Add semantic search race guard with semanticSearchId counter (#11)
- Remove hybrid loading indicator on mode-changed early return (#5)
- Close EventSource before overwrite in triggerIndexing (#6)
- Always set research metadata fields in _enrich_with_research_metadata (#10)
- Wrap button text in <span> for mobile CSS (#12)
- Guard SafeLogger usage in constants.js (#16)
- Update tests for per-user lock and partial RAG indexing recovery
* fix: wrap DB session in try/except, guard renderSemanticResults, fix modal leak and transition
- Move try/except to wrap get_user_db_session in add_research_to_collection
so DB errors return JSON instead of raw 500 HTML
- Add typeof guard for window.renderSemanticResults alongside existing
semanticSearchHistory check in semantic search mode
- Use bootstrap.Modal.getOrCreateInstance instead of new bootstrap.Modal
to prevent duplicate instance creation on repeated clicks
- Use double-rAF for settling transition so browser paints the 0.6 opacity
frame before removing the class
* fix: prevent IntegrityError on missing SourceType, fix Library button class, reset progress color
- Return None from _create_document_from_report/_create_document_from_source
when SourceType rows are missing instead of proceeding with
source_type_id=None which violates the NOT NULL constraint
- Fix Library button CSS class: btn-outline → ldr-btn-outline to match
all other action buttons
- Reset progressText.style.color on new indexing attempt so error red
doesn't persist into subsequent runs
* test: add partial RAG retry and ensure_research_history_collection tests
- TestIndexResearch: add test_index_research_retries_unindexed_documents
verifying that a second index_research call with index_to_rag=True
returns "success" (not "skipped") and calls _index_documents_to_rag
when DocumentCollection.indexed is False after the first pass
- TestEnsureResearchHistoryCollection: new class with three tests
covering create-when-missing, return-existing-id, and exception
re-raise paths of ensure_research_history_collection
* test: add 404 and enrichment-default-fields cases to TestSearchCollectionRoute
- test_collection_not_found_404: mocks get_user_db_session as a context
manager whose filter_by().first() returns None, asserts 404 with
success=False and "not found" in error
- test_enrich_default_fields_when_document_not_in_db: mocks two successive
get_user_db_session calls (collection lookup + enrichment join) and
CollectionSearchEngine; when the join returns no rows the enrichment
branch falls into the else path and sets type='source', research_id=None,
research_title='', etc.
* refactor: use handle_api_error, namespace HistorySearch globals, URL and bootstrap guards
- Replace 4 inline try/except error handlers with handle_api_error()
from research_library.utils for consistent error response format
- Namespace 5 window.* globals from history_search.js under
window.HistorySearch to avoid polluting the global scope
- Replace hardcoded /api/delete/ URL with URLBuilder.deleteResearch()
- Add typeof bootstrap guard in save_to_collection.js modal creation
* docs: expand Research History collection description
Clarify that indexing enables AI-powered semantic search and that
the collection is used by the History page search in AI/Hybrid mode.
* fix: remove hardcoded setting fallbacks from rag_service_factory
- Remove inline default values from get_setting() calls — the settings
system loads defaults from JSON config files automatically
- Replace silent fallback on invalid JSON text_separators with ValueError
- Replace `or` fallbacks on collection fields with proper `is not None`
checks to avoid swallowing legitimate 0 or empty values
- Fix test mock to use properly escaped JSON string for text_separators
- Update test_invalid_json_text_separators to expect ValueError
* fix: use DocumentCollection join for research history counts
Replace get_indexing_status() call in get_research_history_collection
route with inline DB queries that count via DocumentCollection join,
matching how the collection page counts. This fixes the mismatch where
the History page showed "1/24 indexed" while the collection page showed
"25 indexed" — the old logic counted by source_type_id which missed
documents added through the collection page directly.
* feat: add convert_all_research() and POST /convert-all route
Adds ResearchHistoryIndexer.convert_all_research(force) which converts
all completed research entries into library Documents within a single DB
session, avoiding the nested-session issues on SQLite that arise when
calling index_research() (which opens its own session) in a loop.
Also adds POST /library/api/research-history/convert-all that delegates
to the new method, accepting an optional `force` JSON field.
Tests cover happy path, already-converted skipping, force re-conversion,
missing SourceType early-return, and source document creation.
* feat: auto-convert research to documents on completion
Thread user_password through cleanup_research_resources →
notify_research_completed so the auto-conversion hook can open
the user's encrypted database.
After research completes, automatically create Document rows in
the History collection (index_to_rag=False — documents only, no
FAISS indexing). Users trigger FAISS via "Index All" on the History
page or the collection page's index button.
* refactor: remove dead code from ResearchHistoryIndexer
Delete get_indexing_status() (replaced by inline queries in the route),
delete index_all_research() (non-streaming, never called in production),
remove rag_indexed/rag_warning from index_research() return dict (no
consumer reads them), drop the unused Callable import, and remove the
corresponding TestIndexAllResearch and TestGetIndexingStatus test
classes.
Also clean up add_research_to_collection route: drop the index_to_rag
pass-through and its docstring entry since the method default (True)
is sufficient.
* feat: chain convert-all before SSE indexing on History page
The "Index All" button now first POSTs to /convert-all to ensure
any unconverted research entries are turned into Documents, then
proceeds with the existing SSE stream for FAISS indexing. The
convert step is fast (~1-2s) and non-fatal — if it fails, FAISS
indexing proceeds anyway since the SSE stream also handles
conversion internally.
* fix: pass user_password on early-termination path, fix skipped counter
- Extract user_password from kwargs before the termination check so
cleanup_research_resources gets it on all paths (not just normal completion)
- Fix convert_all_research() skipped counter: count total eligible entries
before filtering so skipped = total_eligible - candidates when force=False
* fix: History page shows same document counts as collection page
- Switch from indexed_research/total_research to indexed_documents/
total_documents so both pages show the same numbers for the same
collection
- Auto-backfill unconverted research entries on History page visit
(idempotent — skips already-converted entries)
- Update label from "research indexed" to "documents indexed"
* refactor: replace SSE EventSource with POST+poll in history_search.js
Switch the History page indexing from a long-lived SSE stream to the
same POST /index/start → poll /index/status pattern already used by
collection_details.js. Removes activeEventSource, adds
indexingPollInterval, startPolling(), and checkAndResumeIndexing()
matching the collection page's field names and 2-second interval.
Also removes the now-unused RESEARCH_HISTORY_INDEX URL constant.
* refactor: remove custom SSE indexing infrastructure from backend
- Remove `index_research_history` SSE endpoint and its per-user lock
infrastructure (`_user_index_locks`, `_user_index_locks_guard`,
`_get_user_lock`) from search_routes.py
- Remove `import time`, `import threading`, `stream_with_context`, `Response`,
and `json` imports that were only needed by the SSE endpoint
- Remove the `indexer.convert_all_research()` auto-call from
`get_research_history_collection` (now read-only)
- Remove `index_all_research_streaming()` and `_index_documents_to_rag()`
methods from ResearchHistoryIndexer; FAISS is handled by the collection's
background worker
- Remove `index_to_rag` parameter from `index_research()` and its
`if index_to_rag` block; update processor_v2.py call site accordingly
* test: remove tests for deleted SSE endpoint and _index_documents_to_rag
- Delete TestIndexResearchHistorySSERoute (index_research_history SSE
endpoint, _get_user_lock, and index_all_research_streaming are gone)
- Delete TestRAGServiceCleanup (_index_documents_to_rag is removed)
- Remove all _index_documents_to_rag patch.object calls and index_to_rag
keyword arguments from remaining TestIndexResearch / TestForcePropagation
/ TestHashCollisionReuse tests
- Strip mock_rag assertions that referenced the removed RAG call
* fix: three frontend robustness fixes in history_search.js
- Add cachedCollectionId null guard in triggerIndexing() with user-facing error message
- Wrap startResp.json() in try/catch to handle non-JSON server responses
- Add pollErrorCount to stop polling after 5 consecutive network errors
* refactor: remove source indexing from History collection, extract auto_convert_research
- Remove source document creation (MIN_SOURCE_CONTENT_LENGTH, SOURCE_TYPE_SOURCE,
_create_document_from_source, ResearchResource queries) from ResearchHistoryIndexer;
History collection now only indexes report documents.
- Add module-level auto_convert_research() function to research_history_indexer.py
with built-in exception handling, replacing the inline try/catch in processor_v2.py.
- Update re-export stub and __init__.py to expose auto_convert_research.
- Allow db_password=user_password pattern and file paths in .gitleaks.toml.
* test: update tests for report-only History collection (no sources)
- Fix test_index_research_creates_documents: expect 1 doc (report only)
- Replace test_converts_sources_with_sufficient_content with
test_converts_report_only_no_sources (asserts only report created)
* fix: round 2 review — import path, rag defaults, double commit, JS bugs
- Fix wrong relative import in processor_v2 (.. → ...) that broke
auto-conversion entirely
- Restore rag_service_factory fallbacks for local_search_* settings
that have no JSON defaults (prevents int(None) crash on fresh install)
- Remove double commit in index_research existing-docs branch
- Fix N+1 SourceType query in convert_all_research loop
- Change inner JOIN to outerjoin on SourceType in search enrichment
- Add auto-conversion on GET /research-history/collection endpoint
- Reset pollErrorCount before new polling session
- Set isIndexing before await to prevent double-click race
- Use URLS config for index/start and index/status endpoints
- Add searchInput null guard in handleSearchInput
- Remove stale hybrid-loading-indicator before appending new one
* fix: round 3 review — null guards, auto-convert test coverage
- Add null guards for indexed-count/total-count DOM elements
- Add test that GET endpoint calls convert_all_research
- Add test that convert_all_research failure doesn't cause 500
* fix: round 4 review — stale sources_indexed, DOMPurify config, docs
- Remove stale sources_indexed reference from save_to_collection success msg
- Remove loadCollections() call from error handler (hides error message)
- Add restrictive DOMPurify config to renderSnippet (limit allowed tags)
- Update SEMANTIC_SEARCH.md API table to match actual routes
* fix: round 5 — ID type mismatch, misleading force docstring
- Use String(h.id) for consistent ID comparisons (dataset values are
always strings, API IDs may be numeric)
- Fix index_research docstring: force does NOT trigger FAISS indexing
* fix: polling response.ok guard, stale collection cache, hybrid UI cleanup
- Check response.ok before .json() in polling/resume to prevent infinite
loop on HTTP errors (history_search.js)
- Clear cachedCollectionId on 404 so next search shows "needs indexing"
instead of permanent failure (history_search.js)
- Remove stale hybrid-loading-indicator on mode switch (history.js)
- Re-trigger handleSearchInput after delete to preserve hybrid/semantic
state instead of losing Tier 1 badges and Tier 3 results (history.js)
- Restore original button HTML on save error instead of showing stuck
"Saving..." spinner (save_to_collection.js)
* fix: remove dead index_single_research endpoint, orphaned stub, escape document_count
- Remove unused index_single_research route (add-to-collection covers
the same use case via collection infrastructure)
- Delete orphaned re-export stub at research_library/services/ (no
consumers, module lives in search/services/)
- Escape document_count in innerHTML for consistency with all other
fields in the same template
- Update docs and tests to match
* fix: session rollback on flush error, SourceType check, status guard, IntegrityError handling
- Add session.rollback() in convert_all_research per-entry except block
to clear PendingRollbackError before next iteration
- Return error (not silent success) when _create_document_from_report
returns None due to missing SourceType
- Guard index_research against non-COMPLETED research to prevent
indexing partial content
- Handle IntegrityError on commit in index_research for concurrent
auto-convert + manual indexing race condition
* refactor: simplify index_research by reusing _create_document_from_report
The 130-line index_research method duplicated existence checks, hash
dedup, and DocumentCollection linking that _create_document_from_report
already handles internally. Collapsed to ~35 lines that validate the
research, delegate to the helper, and commit.
- Remove redundant "existing docs" branch (60+ lines)
- Remove unused force parameter (no frontend caller)
- Remove dead get_collection() method (no callers)
- Update tests: remove TestForcePropagation, update idempotency assertion
* refactor: deduplicate _get_rag_service_for_thread by reusing rag_service_factory
_get_rag_service_for_thread duplicated ~95% of the settings resolution
logic from rag_service_factory.get_rag_service (default settings loading,
JSON parsing, collection settings lookup). Replace with a thin wrapper
that delegates to the factory and propagates db_password to sub-managers
via the property setter for thread-safe access.
Reduces 140 lines → 28 lines. Settings changes now only need to be made
in one place (rag_service_factory).
* fix: batch rollback bug in convert_all_research, guard nullable .value
- convert_all_research: commit per-entry instead of batching 100, so a
single failure only rolls back that entry (not the whole batch)
- rag_service_factory: guard collection.embedding_model_type.value for
nullable column to prevent AttributeError
- docs: fix endpoint count (4, not 5) in SEMANTIC_SEARCH.md
* fix: response.ok check, URL encoding, doc diagram path
- history_search.js: add response.ok check before parsing JSON in
triggerIndexing (consistent with other fetch calls in the file)
- history.js: use URLS.PAGES.LIBRARY + encodeURIComponent instead of
hardcoded string for library navigation
- SEMANTIC_SEARCH.md: fix diagram path to include /library/api/ prefix
|
||
|
|
54b9dc2579 |
ci: remove OSSAR scan from release gate (#2911)
OSSAR's summary step hardcodes "192 ESLint Warnings" and specific file names regardless of actual scan results, providing zero dynamic signal. It also uses the deprecated `set-output` command. CodeQL + Semgrep + Bearer already provide comprehensive SAST coverage. ESLint checks are handled by pre-commit hooks. |
||
|
|
05b96fbe3f |
refactor: move engine module paths from settings DB to hardcoded registry (#2843)
* refactor: move engine module paths from settings DB to hardcoded registry Engine implementation details (module_path, class_name, full_search_module, full_search_class) are internal wiring, not user configuration. Storing them in the settings DB created a security attack surface requiring blocklist validation and route blocking. Changes: - New engine_registry.py with frozen dataclass entries for all 24 engines - search_engines_config.py injects registry data after loading DB settings - search_engine_factory.py passes engine_config to full search wrapper - Remove ~52 module/class entries from 9 JSON defaults files - Remove BLOCKED_SETTING_PATTERNS, is_blocked_setting(), and 4 call sites - Remove absolute→relative normalization from module_whitelist.py - Update docs, tests, and golden master * fix: remove TestGetBlockedSettingsError that references removed function The get_blocked_settings_error() function was removed as part of the engine registry refactor. This test class was added on main after the PR was created and wasn't caught by conflict resolution. * fix: remove TestSaveSettingsPostBlockedSetting that tests removed blocking logic BLOCKED_SETTING_PATTERNS and is_blocked_setting() were removed as part of the engine registry refactor. This test was added on main and references the now-removed blocking behavior. * fix: inject ENGINE_REGISTRY into parallel/meta engine _get_search_config() Both ParallelSearchEngine and MetaSearchEngine manually extract config from settings_snapshot without going through search_config(). Since module_path/class_name are no longer in the settings DB (they live in the hardcoded registry), these engines would silently fail to discover sub-engines on fresh installations. Fix: inject ENGINE_REGISTRY values after extraction, matching the pattern used in search_config(). Also fixes MetaSearchEngine's stale check for "search.engine.auto.class_name" in settings_snapshot — this key no longer exists in settings DB, so auto engine config would be skipped. * fix: update tests for engine registry refactor - test_whitelist_config_consistency: check ENGINE_REGISTRY instead of JSON defaults (module_path/class_name no longer in defaults) - test_meta_search_engine_high_value: expect registry-injected module_path/class_name in _get_search_config() output - test_meta_search_engine_extended: registry overwrites snapshot values - test_settings_routes_coverage: remove blocked setting tests (blocking logic removed — registry is now the security mechanism) - test_settings_routes_deep_coverage2: same as above * fix: add 5 missing engines to registry, strip module_path from their settings Add gutenberg, openlibrary, pubchem, stackexchange, and zenodo to ENGINE_REGISTRY (were added to main in #1540 after this branch diverged). Remove module_path/class_name from their settings JSON files and golden master, matching the pattern established for all other engines. Expand test_engine_registry.py to scan per-engine settings_*.json files and verify no settings files still contain module_path/class_name. * fix: inject full_search_module/class in meta/parallel engine _get_search_config() The registry injection in MetaSearchEngine and ParallelSearchEngine was missing full_search_module and full_search_class fields, making it inconsistent with the main search_config() injection. This would cause full-search wrappers to fail when created through meta/parallel engines. * fix: resolve pre-commit formatting issues and sync pdm.lock after merge with main |
||
|
|
d89c96353d |
remove: dedicated vLLM provider (use openai_endpoint instead)
The in-process vLLM provider (requiring torch+transformers+vllm ~10GB) is obsolete — vLLM is universally run as a server and accessed via its OpenAI-compatible API, which the openai_endpoint provider already handles. Removes vllm from: config, pricing, rate limiting, hardware warnings, frontend dropdowns, pyproject.toml optional deps, docs, default_settings.json, golden master, benchmark template, and all related tests (37 files, -436 lines). Keeps vLLM mentions in openai_endpoint context (labels, docs) since that's the correct usage path. |
||
|
|
9988f70318 |
refactor: remove fallback LLM (FakeListChatModel) from all providers (#2717)
* cleanup: remove @pytest.mark.requires_llm decorators and fallback LLM doc references Remove the `@pytest.mark.requires_llm` decorator from all test files since the fallback LLM infrastructure is being removed. Update docs to remove references to `LDR_TESTING_USE_FALLBACK_LLM` and `LDR_USE_FALLBACK_LLM` environment variables from troubleshooting and CI configuration tables. * test: remove fallback LLM references from test files Remove all fallback-related test code: TestGetFallbackModel classes, FakeListChatModel assertions, check_fallback_llm parameters, and LDR_USE_FALLBACK_LLM skipif markers. Replace fallback-returning tests with ValueError-expecting tests for missing API keys and unavailable providers. * cleanup: remove remaining use_fallback_llm references from source and tests Remove use_fallback_llm() imports and calls from db_utils.py and rate_limiting/tracker.py. Clean up test files that referenced check_fallback_llm, get_llm_setting_from_snapshot, and LDR_USE_FALLBACK_LLM env var. * cleanup: remove remaining fallback LLM references from test files Remove all use_fallback_llm mocks, LDR_USE_FALLBACK_LLM env var checks, and related skip logic from test files since the fallback LLM feature has been removed from source code. - test_db_utils.py: Remove use_fallback_llm mock patches from 4 tests - test_rate_limiter.py: Replace use_fallback_llm mock with is_ci_environment - test_tracker.py: Replace fallback mode test with CI mode test - test_tracker_quality_stats.py: Remove 8 use_fallback_llm decorators - test_openai_api_key_usage.py: Remove LDR_USE_FALLBACK_LLM skipif - test_llm_provider_integration.py: Remove LDR_USE_FALLBACK_LLM skipif - test_ci_config.py: Remove LDR_USE_FALLBACK_LLM env var setting - test_search_system.py: Remove LDR_USE_FALLBACK_LLM skipif - run_all_tests.py: Remove LDR_USE_FALLBACK_LLM log line - test_env_auto_generation.py: Remove testing.use_fallback_llm mapping - test_lmstudio_provider.py: Fix docstring referencing removed function * refactor: remove fallback LLM from providers, settings, CI, and tests - Remove FakeListChatModel import and get_llm_setting_from_snapshot wrapper - Update all provider imports to use get_setting_from_snapshot directly - Remove LDR_USE_FALLBACK_LLM env var from CI workflows - Remove use_fallback_llm setting and registry function - Remove skip_if_using_fallback_llm fixture from conftest.py - Update tests to expect ValueError instead of fallback model * refactor: remove fallback model from llm_config and thread_settings - Remove get_fallback_model() and all call sites in get_llm() - Replace fallback returns with descriptive ValueError raises - Remove LDR_USE_FALLBACK_LLM env check block from get_llm() - Remove check_fallback_llm parameter from get_setting_from_snapshot - Remove get_llm_setting_from_snapshot convenience wrapper - Add ValueError re-raise in Ollama model-not-found path - Regenerate golden master with ensure_ascii=False for proper Unicode * fix: restore requires_llm skip mechanism and fix CI test failures Three fixes for CI regressions from fallback LLM removal: 1. Restore @pytest.mark.requires_llm decorator and skip fixture (skip_if_no_real_llm) that checks LDR_TESTING_WITH_MOCKS env var. Re-add decorators to 17+ tests across 9 files that need real LLMs. 2. Fix type coercion in test_openai_api_key_usage.py by converting fixture from dict format to simplified raw-value format, bypassing get_typed_setting_value string coercion. 3. Fix golden master format mismatch by adding ensure_ascii=False to test serialization to match regeneration script. Narrow pre-commit hook trigger to only defaults/*.json files. * fix: remove remaining fallback LLM references from coverage tests - Delete TestGetFallbackModel class from test_llm_config_coverage.py (5 tests that imported removed get_fallback_model) - Update test_llm_config_missing_coverage.py: 6 tests that expected FakeListChatModel fallback now expect ValueError/exception raises - Remove use_fallback_llm mocks from test_rate_limiting_tracker_coverage.py (delete 4 fallback-specific tests, fix 9 tests) - Remove use_fallback_llm mocks from rate_limiting/test_tracker_coverage.py (fix _make_tracker helper and 25 tests) - Add @pytest.mark.requires_llm to test_analyze_documents_minimal - Merge upstream main to pick up new coverage test files * fix: remove dead LDR_USE_FALLBACK_LLM env var from accessibility tests CI This env var was added to the accessibility test server but has no effect since the fallback LLM code was removed. * fix: align pre-commit hook description and error listing with defaults-only trigger The hook file pattern was narrowed to defaults/ only, but the description and error-listing code still referenced config/. Remove dead config/ path from the file listing and update messaging to match. * fix: update test_llm_config_deep_coverage.py for fallback LLM removal File was added on main after branch diverged. Remove TestGetLlmFallbackEnvVar class (tests removed functionality) and update test_provider_lowercased to expect ValueError instead of fallback model. * fix: improve "none" provider error message and fix stale CI-mode test - Add explicit handler for provider="none" with user-friendly message instead of misleading "this is a bug" error - Fix test_load_estimates_skipped_in_ci_mode: _load_estimates no longer checks is_ci_environment, test now correctly verifies deferred loading behavior in non-programmatic mode - Update 4 test assertions to match new "none" provider error message |
||
|
|
add97b1793 |
docs: polish installation docs after migration (#2889)
* docs: move detailed installation instructions from README to dedicated pages README Installation Options section (~200 lines) replaced with a compact table linking to docs/installation.md (hub page), docs/install-pip.md (dedicated pip guide), and existing docker-compose and Unraid guides. No content lost — everything is now in focused doc files. * docs: trim redundant pip section in installation hub page The pip section in docs/installation.md duplicated nearly all of the Quick Install content from docs/install-pip.md. Replace with a brief summary + single install command + link to the dedicated guide, consistent with the hub-and-spoke pattern used by the Unraid section. Addresses review feedback from djpetti on PR #2819. * docs: restore missing installation info from README migration - Add NVIDIA Container Toolkit full install commands (Ubuntu/Debian) with distro note for RHEL/Fedora/Arch to docs/installation.md - Add GPU docker-compose alias convenience tip - Add DIY docker-compose configuration guidance (GPU driver, context length, keep alive, model selection) - Add Windows PDF export warning (Pango/WeasyPrint) to docs/install-pip.md - Fix SQLCipher wording: pre-built wheels available, not "requires system-level libraries" - Restore ldr-web command instead of python -m invocation * docs: follow-up polish for installation docs migration - Restructure README Quick Start with clear Option 1/2/3 labels - Update deprecated LDR_ALLOW_UNENCRYPTED to LDR_BOOTSTRAP_ALLOW_UNENCRYPTED - Add "Open http://localhost:5000" to install-pip.md after ldr-web step - Add back-link from install-pip.md to installation overview - Add Docker/Docker Compose install prerequisite links to installation.md - Cross-link NVIDIA toolkit commands from docker-compose-guide to installation.md - Use double quotes for volume spec in Docker Run for cross-platform compat * docs: restore original Quick Start ordering (Docker Run first) |
||
|
|
abbd19584a |
docs: move detailed install instructions from README to dedicated pages (#2819)
* docs: move detailed installation instructions from README to dedicated pages README Installation Options section (~200 lines) replaced with a compact table linking to docs/installation.md (hub page), docs/install-pip.md (dedicated pip guide), and existing docker-compose and Unraid guides. No content lost — everything is now in focused doc files. * docs: trim redundant pip section in installation hub page The pip section in docs/installation.md duplicated nearly all of the Quick Install content from docs/install-pip.md. Replace with a brief summary + single install command + link to the dedicated guide, consistent with the hub-and-spoke pattern used by the Unraid section. Addresses review feedback from djpetti on PR #2819. * docs: restore missing installation info from README migration - Add NVIDIA Container Toolkit full install commands (Ubuntu/Debian) with distro note for RHEL/Fedora/Arch to docs/installation.md - Add GPU docker-compose alias convenience tip - Add DIY docker-compose configuration guidance (GPU driver, context length, keep alive, model selection) - Add Windows PDF export warning (Pango/WeasyPrint) to docs/install-pip.md - Fix SQLCipher wording: pre-built wheels available, not "requires system-level libraries" - Restore ldr-web command instead of python -m invocation |
||
|
|
76d8518a1b |
docs: pip install now works natively on Windows (#2766)
* docs: update Windows install docs — pip install now works natively sqlcipher3 0.6.2+ ships self-contained Windows wheels (5.9MB .pyd with SQLCipher + OpenSSL statically linked). No compilation, Visual Studio, or system libraries needed. Update README and SQLCipher guide to reflect this, removing the "for developers" framing and outdated warnings. Refs #494 * docs: fix README consistency issues from review - Align SQLCipher wording across quick start and Option 3 sections - Replace "skip it" with "use standard SQLite instead" - Replace duplicate pip snippet in Docker section with cross-reference link - Use `ldr-web` consistently instead of `python -m local_deep_research.web.app` - Add Windows PDF export note (WeasyPrint/Pango) to Option 3 - Replace SQLCipher guide link in quick start with WeasyPrint setup link |
||
|
|
7d37d35b2f |
fix: normalize full_search_module paths and remove dead serpapi references (#2826)
* fix: remove dead serpapi full_search_module/class references The serpapi engine pointed to `.engines.full_serp_search_results_old` with class `FullSerpAPISearchResults`, but neither the module file nor the class exist. All engines (including serpapi) use `.engines.full_search` / `FullSearchResults`. Update defaults, golden master, docs, and remove the stale whitelist entry. * fix: normalize full_search_module in search_config() search_config() normalized legacy absolute module_path values but skipped full_search_module. Extend the normalization loop to cover both keys for consistency with the defense-in-depth normalization in get_safe_module_class(). * fix: check full_search_module key in pre-commit hook The pre-commit hook only validated module_path keys in JSON files. Extend it to also check full_search_module, and add regression tests for both cases. * fix: add debug logging for absolute module path normalization When get_safe_module_class() normalizes an absolute path to relative form, log the conversion at debug level for easier debugging of Docker user issues. |
||
|
|
8ea4787626 |
fix: rename "Custom OpenAI Endpoint" to "OpenAI-Compatible Endpoint" (#2745) (#2818)
Users selecting Llama.cpp couldn't find the right provider for custom endpoints because four different names were used across the codebase. Standardize on "OpenAI-Compatible Endpoint" — the industry-standard naming used by LM Studio, Ollama, Open WebUI, vLLM, and others. Changes: - Provider class: provider_name → "OpenAI-Compatible Endpoint" - Legacy config, default_settings.json, golden master: consistent name - JS fallbacks (settings.js, benchmark.html): updated dropdown labels - Llama.cpp label clarified to "(Local GGUF files only)" - Docs (faq.md, env_configuration.md): updated references - Tests: updated assertions and docstrings No breaking changes — internal keys (openai_endpoint, OPENAI_ENDPOINT), setting paths, class/function names, and file names are unchanged. |
||
|
|
e6d45ab5bb |
chore: auto-bump version to 1.3.60 (#2709)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
d67438b239 |
chore: auto-bump version to 1.3.59 (#2527)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |
||
|
|
19c1777e97 |
docs: fix inaccurate credential sweep wording and inconsistent file paths (#2614)
- Change "credential sweep (every request)" to "dead-thread credential tracking cleanup (every request)" in teardown_appcontext table row, since cleanup_dead_threads() removes thread credential tracking entries rather than performing active credential security clearing - Use full src/local_deep_research/ paths in Key Files table to match the convention used in the main Key Source Files table |
||
|
|
0b23d58e85 |
docs: thread lifecycle, FD budget, and resource exhaustion (#2605)
* fix: prevent file descriptor exhaustion from dead thread engine accumulation Three root causes addressed: 1. Dead thread engine accumulation (primary): _thread_engines grows unboundedly as crashed/terminated threads leave orphaned NullPool engines. Add cleanup_dead_thread_engines() that sweeps entries for threads no longer in threading.enumerate(). Integrate via throttled sweep in teardown_appcontext (every 60s) and periodic sweep in the queue processor loop (every 6 iterations). 2. Generic downloader stream=True leak (secondary): generic.py used stream=True but never read or closed the response body, holding connections open. Removed stream=True since only status_code and headers are inspected. 3. Docker default 1024 FD limit (contributing): Add nofile ulimit (65536) to docker-compose.yml so the container has headroom for WAL mode databases, thread pools, and connection pools. * fix: address review findings — sweep lock, credential cleanup, flaky test - Add _sweep_lock to prevent TOCTOU race on _last_sweep_time in maybe_sweep_dead_engines() (concurrent teardowns could all pass the interval check) - Move alive_ids computation inside _thread_engine_lock to prevent race between snapshot and engine dict mutation - Sweep dead _thread_credentials (plaintext passwords) alongside engines in processor_v2.py and app_factory.py teardown - Fix flaky test_sweeps_after_interval: replace time.sleep(0.15) with _last_sweep_time backdating - Add tests for credential sweep and module-level cleanup_dead_threads() * fix: close search engine sessions after research, fix stream=True leak properly Three improvements to the FD exhaustion fix: 1. generic.py: Restore stream=True (removing it is unsafe — GenericDownloader handles ALL URLs and would download multi-GB files into memory). Use context manager instead to ensure the streamed connection is properly closed on all return paths, preventing socket FD leaks. 2. research_service.py: Add use_search.close() and system.close() in finally block of run_research_process(). Search engine HTTP sessions (e.g. SemanticScholar's SafeSession) were never explicitly closed after research, relying on non-deterministic GC for cleanup. 3. search_system.py + strategies: Add close() method to AdvancedSearchSystem and BaseSearchStrategy, with overrides in ConstraintParallelStrategy and ConcurrentDualConfidenceStrategy to shut down persistent ThreadPoolExecutors. Also adds detailed design comments throughout the codebase documenting: - Why NullPool engines don't leak FDs (memory leak only) - Why stream=True must NOT be removed from the diagnostic block - The dual sweep trigger architecture (request-driven + queue-driven) - Thread ID recycling limitations - Search engine lifecycle and cleanup responsibilities Fixes flaky test_removes_dead_thread_entries by using threading.Barrier to prevent thread ID recycling during test. * fix: unregister user from news scheduler on logout The logout handler never called scheduler.unregister_user(), causing: - Passwords to persist in scheduler memory for up to 48 hours - Orphaned APScheduler jobs to keep running after logout - Orphaned jobs to re-create QueuePool engines (~10 FDs each) after close_user_database() disposed the original, contributing to FD leaks Add scheduler unregistration before close_user_database() so running jobs can finish gracefully while the DB engine is still available. Add design comment documenting the logout cleanup order. * test: remove ineffective patch in logout scheduler test The `routes.get_news_scheduler` patch was ineffective because the logout handler imports `get_news_scheduler` dynamically inside the function body, so the name never enters the routes module namespace. The `create=True` flag masked this by silently creating a new attribute. The real patch on `subscription_manager.scheduler.get_news_scheduler` is sufficient. * fix: remove nofile ulimit override from docker-compose.yml Docker containers inherit ulimits from the Docker daemon, which typically runs with LimitNOFILE=infinity (1073741816+). Setting nofile to 65536 could actually *lower* the limit for most users, hurting large installations. The FD leak root causes are already fixed in this PR (dead-thread engine sweep, session close, scheduler unregister), so the safety net is unnecessary. Let users and their Docker daemon config control this. * fix: add try-except to strategy executor shutdown, elevate scheduler unregister log level - Wrap executor.shutdown(wait=False) in try-except in strategy close() methods for consistency with parallel_search_engine.py pattern - Change logger.debug → logger.warning for scheduler unregister failure on logout, since failure means password stays in scheduler memory * docs: add comments explaining non-obvious design decisions from deep review - SQLCipher WAL FD cost (1-3 FDs per connection, multiplied by users) - Logout cleanup ordering: why unregister before close, known race window - shutdown(wait=False): why non-blocking, safety via double-cleanup pattern * docs: add thread lifecycle, FD budget, and resource exhaustion documentation Knowledge captured from PR #2591 deep review (5 rounds of verification): - architecture.md: Thread & Resource Lifecycle section with cleanup layers, mermaid diagram, FD budget table, and key files reference - troubleshooting.md: Resource Exhaustion section with diagnosis commands and solutions for FD exhaustion - docker-compose-guide.md: Resource Limits note explaining nofile/memlock - web/database/README.md: Thread Safety & Connection Model section - Cross-references added between all 4 docs - Updated Areas for Improvement (container optimization → resource observability) - Added encrypted_db.py and thread_local_session.py to Key Source Files |
||
|
|
8d32f5f9e3 |
refactor: eliminate server_config.json — env-var-only server settings (#2505)
* refactor: eliminate server_config.json, make server settings env-var-only Remove the JSON file-based server configuration and sync mechanism. All 8 server settings (host, port, debug, HTTPS, allow_registrations, and 3 rate limits) are now read exclusively from environment variables via get_typed_setting_value() with the existing LDR_* naming convention. - Rewrite server_config.py: remove get_server_config_path(), save_server_config(), sync_from_settings(); simplify load_server_config() to use get_typed_setting_value(key, None, ...) for all settings - Add rate_limit_settings to the config dict (was only via .get() fallback) - Remove sync_from_settings calls from 3 sites in settings_routes.py - Hide server settings from UI (visible: false, editable: false) in default_settings.json and settings_security.json - Add security.rate_limit_settings entry to settings_security.json - Fix swapped min/max on web.port (was min:65535, max:0) - Update descriptions to reference env var names - Rewrite test_server_config.py: remove 21 JSON-file tests, keep 13 defaults/fail-closed tests, add 8 env var override tests (35 total) - Regenerate golden master settings - Remove server_config.py from check-file-writes.sh exemption list - Update docstrings in rate_limiter.py and app.py * fix: address review findings for server_config.json elimination - Fix save_all_settings response: return dict (keyed by setting key) instead of list, matching GET /settings/api shape; include missing visible, min_value, max_value, step fields so visibility filter works - Fix JS consumer: use dict key access instead of .find() on response - Fix docs: LDR_WEB_PORT is the correct env var for server bind port, not LDR_APP_PORT; add clarifying note - Remove stale KNOWN_NUMERIC_ISSUES entry for web.port (now fixed) - Add tests: empty-string and whitespace env var edge cases for allow_registrations fail-closed, and env-var override coverage * feat: add deprecation migration path for server_config.json (#2549) Users who set `allow_registrations: false` via the UI (persisted in server_config.json) would silently lose that setting on upgrade, re-enabling open registration. Docker users are especially at risk since named volumes persist the file across container upgrades. Add read-only migration: if server_config.json exists, honor its values as fallbacks (env var > legacy file > default) and log deprecation warnings guiding users to migrate to env vars. No write-back logic is re-added — save_server_config() and sync_from_settings() remain removed per the PR's intent. * feat: show web UI warning when legacy server_config.json is detected Adds a dismissible warning banner in the web interface when the deprecated server_config.json file exists, using the existing warning_checks system. Addresses reviewer feedback from PR #2505. * fix: address review findings for server_config.json elimination - Change web.host, web.port, web.use_https type from SEARCH to APP in both default_settings.json and golden_master_settings.json - Add 3 tests for check_legacy_server_config() covering dismissed, missing file, and file-exists branches - Add autouse fixture to clear LDR_APP_ALLOW_REGISTRATIONS env var in test_server_config.py to prevent test pollution from dev shell * fix: round 2 review findings for server_config.json elimination - Fix flaky test_all_four_warnings_simultaneously by mocking get_server_config_path to prevent real server_config.json on disk from breaking exact set equality assertion - Add dismiss_legacy_config to _make_settings_manager defaults and rename test_all_six_settings_read → test_all_seven_settings_read - Add orchestrator-level tests for legacy_server_config warning (exists/absent/dismissed scenarios) - Add fail-closed guard for legacy JSON allow_registrations string values (e.g. "disabled" → False) to match env var guard - Log warning for unrecognized keys in legacy server_config.json to surface typos like "Port" instead of "port" - Regenerate CONFIGURATION.md to remove stale server_config.json reference in app.debug description * fix: round 3 review findings — test quality and migration docs - Replace vacuous `is not None` assertion with meaningful env-var-vs-legacy guard priority test using unrecognized values on both paths - Add positive test for DEPRECATED banner when recognized keys present - Rename misleading test name to reflect actual scope (hardware + context) - Add migration section to env_configuration.md for server_config.json users |
||
|
|
f246fa6044 |
docs: add comprehensive MCP server documentation (#2546)
* docs: add comprehensive MCP server documentation - Create standalone docs/mcp-server.md with full MCP server docs covering installation, configuration, all 7 tools, research strategies guide, ReAct agentic strategy deep dive, MCP client setup, error handling, security model, Docker deployment, usage examples, and troubleshooting - Add MCP Server section to docs/features.md under Advanced Features - Add MCP Server CLI section to docs/cli-tools.md - Fix search.search_strategy -> search.strategy in server.py and tests to match renamed setting from #2550 * fix(docs): correct 9 issues found in MCP server documentation review - Revert search.strategy → search.search_strategy in server.py and tests (6 occurrences) - Fix collection_name description: it's an engine ID, not a display name - Fix invalid JSON in analyze_documents return example - Add missing MCP Server CLI entry to cli-tools.md TOC - Add unknown error type to error handling table - Fix broken MCP security guide external link - Clarify Docker section: MCP must run on host (STDIO can't bridge containers) - Fix "7 research tools" → "7 tools (4 research, 3 discovery)" in features.md - Add temperature valid range note (0.0-2.0) * feat(mcp): add `search` tool for raw search results without LLM Add a new MCP tool that calls a specific search engine and returns raw results (title, link, snippet) without LLM processing. This enables external AI agents to perform fast, cost-free searches and handle result analysis themselves. - Required `engine` parameter with validation against available engines - API key presence check before engine creation - Body-to-snippet normalization for consistent output - 8 test cases covering success, errors, and edge cases - Updated docs with tool count (7→8) and parameter reference * fix(mcp): set thread-local settings context in search tool Some engine constructors (e.g., arxiv's JournalReputationFilter) call get_llm() internally without passing settings_snapshot, falling through to the thread-local settings context. Set and clean up the context so these engines can resolve settings correctly. * docs: add OpenClaw MCP client configuration (#2562) Add OpenClaw configuration subsection alongside Claude Desktop in the MCP server guide, as suggested in PR #2546 review. * docs: add Claude Code config, individual search engine examples, and openclaw - Add Claude Code MCP configuration (.mcp.json) to README and mcp-server.md - Add search tool to README tools table with LLM Cost column - Add individual search engine examples (arxiv, pubmed, wikipedia, openclaw) - Highlight search tool usefulness for monitoring and subscriptions - List common engines in mcp-server.md search tool section |
||
|
|
df52e3ec3e |
feat: implement Reddit feedback improvements (#1909)
* feat: implement Reddit feedback improvements Based on user feedback from r/LocalLLaMA, this commit addresses several documentation and usability issues: Documentation: - Add macOS port 5000 conflict documentation (AirPlay Receiver conflict) - Create comprehensive reverse proxy guide (Caddy, Nginx, Traefik) - Add debug logging guidance with platform-specific paths UI/UX: - Add "Advanced" badge and tooltip to Detailed Report mode to set expectations Feature: - Add opt-in LLM prompt/response logging (LDR_LOG_LLM_CALLS=true) for debugging Closes feedback from: reddit.com/r/LocalLLaMA/comments/1qdj2nn/ * fix: correct mypy type issues in llm_log_utils * docs: remove reverse proxy guide, keep inline note instead The reverse proxy configuration is generic infrastructure knowledge not specific to LDR. Replaced the guide link with a one-liner noting that LDR uses HTTP polling and works with any standard reverse proxy. * perf: pre-compile regex patterns in llm_log_utils Avoids recompiling 6 regex patterns on every sanitize() call. * refactor: keep docs and CSS, remove LLM logging feature - Keep macOS port 5000/AirPlay troubleshooting docs - Keep debug logging documentation (fixed LDR_LOG_LEVEL → LDR_ENABLE_FILE_LOGGING) - Keep .ldr-mode-badge CSS and Advanced badge UI change - Restore correct Nginx WebSocket reverse proxy config - Remove LLM logging feature (suggest as separate focused PR) * docs: add security note at top of Debug Logging section Move the log file security warning to a prominent blockquote at the start of the section so it is not overlooked. |
||
|
|
a466877273 |
security: gate global scheduler control behind setting (#2035)
* security: gate global scheduler control behind setting The news scheduler is a global singleton — starting, stopping, or triggering it affects all users. Add a setting to control whether these operations are accessible via API. - Add `news.scheduler.allow_api_control` setting (default: true) - Env var: LDR_NEWS_SCHEDULER_ALLOW_API_CONTROL - Also configurable via settings UI - Add `@scheduler_control_required` decorator that checks the setting - Apply to destructive endpoints: start, stop, check-now, cleanup-now - Read-only endpoints (status, users, stats) remain accessible to any authenticated user Multi-user deployments can set `LDR_NEWS_SCHEDULER_ALLOW_API_CONTROL=false` to prevent any user from starting/stopping the global scheduler. * test: add tests for scheduler_control_required decorator Tests cover: - Decorator allows execution when setting is enabled - Decorator returns 403 when setting is disabled - Error response includes informative message - Correct setting key is checked - Function name is preserved (wraps) * fix: make scheduler API control setting non-editable for security The news.scheduler.allow_api_control setting controls a global security boundary (the scheduler singleton affects all users). Following the precedent set by app.allow_registrations, this setting should not be editable from the UI — it must be configured via environment variable LDR_NEWS_SCHEDULER_ALLOW_API_CONTROL only. Also adds integration tests verifying that mutating scheduler endpoints (start, stop, check-now, cleanup-now) return 403 when disabled, while read-only endpoints (status, users, stats) remain accessible. * fix: change allow_api_control default from true to false Make scheduler API control secure-by-default. Since the setting is editable: false (env-var-only), no existing UI state is affected. Users who want API control must now explicitly opt in via LDR_NEWS_SCHEDULER_ALLOW_API_CONTROL=true. * fix: regenerate config docs and add audit logging for scheduler gate Regenerate CONFIGURATION.md to include the new news.scheduler.allow_api_control setting (fixes check-config-docs CI). Add logger.warning when scheduler API control is blocked, for audit trail in multi-user deployments. |
||
|
|
d896970152 |
chore: auto-bump version to 1.3.58 (#2458)
Co-authored-by: LearningCircuit <185559241+LearningCircuit@users.noreply.github.com> |