mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia, GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via a new `needs_reranking` class attribute. This fixes the priority bug where the global `skip_relevance_filter=True` incorrectly overrode auto-detection for engines that genuinely need filtering. Priority is now: per-engine setting > needs_reranking > global skip. The global skip only affects unclassified engines. Closes #2297 * fix: address 7 code-review issues on needs_reranking branch 1. Rename needs_reranking → needs_llm_relevance_filter for consistency with enable_llm_relevance_filter and skip_relevance_filter naming 2. Fix Paperless dead code: replace non-existent _apply_content_filters with proper _filter_for_relevance() call in custom run() override 3. Fix misleading skip_relevance_filter description to accurately reflect checkbox behavior and keyword engine exceptions 4. Delete 4 vacuously-true inline tests that duplicated factory logic instead of calling the real factory (coverage tests already exist) 5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md 6. Clarify is_generic comment: generic does not imply good ranking 7. Upgrade no-LLM log from debug to warning when filtering was requested but no LLM is available (with should_filter guard) * fix: remove Paperless fallback that overrode valid empty LLM filter results Replace the fallback that restored all previews when the LLM filter returned empty with an info log. The base class _filter_for_relevance() already handles errors internally (returns previews[:5] on exception or JSON parse failure). An empty result means the LLM legitimately found nothing relevant — trust it, don't override it. * refactor: rename needs_llm_relevance_filter → is_lexical The flag describes what the engine IS (lexical/keyword-based search) rather than what it needs. This is a general classification that can drive multiple behaviors beyond just the relevance filter — e.g. query optimization strategies, result deduplication, or UI hints. Matches the existing is_* naming pattern (is_scientific, is_generic). * Revert "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commitc322d478a1. * Reapply "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commit853dfe90bd. * feat: add is_lexical classification flag alongside needs_llm_relevance_filter Separates classification from behavior: - is_lexical: informational flag indicating the engine uses keyword/lexical search. Reusable for query optimization, UI hints, deduplication, etc. - needs_llm_relevance_filter: behavioral flag that the factory reads to auto-enable LLM relevance filtering on the engine instance. Both flags are set on all 15 keyword-based engines. The factory only checks needs_llm_relevance_filter for filtering decisions. * fix: improve relevance filter error handling and logging - Return [] on all error paths instead of hiding failures behind previews[:5] fallback — failures should be visible, not masked - Log errors at error level (not warning) for LLM parse failures - Add engine name prefix to all log messages for traceability - Add token estimate debug log to help diagnose context overflow - Reduce log noise: routine operations are debug, only summary is info - Consolidate validation into single check * fix: address PR review findings for relevance filter - Fix literal \n in EXTENDING.md code block - Remove 'Maximum results to return' from LLM prompt (LLM decides) - Add INPUT/KEPT/REMOVED debug logging for filter quality analysis - Add is_lexical + needs_llm_relevance_filter to ElasticsearchSearchEngine - Delete vacuously-true test_missing_llm_returns_none test - Downgrade no-op skip_relevance_filter log from info to debug * refactor: extract relevance filter into dedicated module Pull the inline _filter_for_relevance() logic out of BaseSearchEngine into a new web_search_engines/relevance_filter.py module. - Use with_structured_output() with Pydantic schema; let LangChain pick the per-provider default method (JSON schema on Ollama, tool-calling on Anthropic, responseSchema on Gemini). - Trim prompt: drop URLs, cap snippets at 200 chars. - Suppress reasoning on Ollama thinking-by-default models via reasoning=False — saves 30-60s per call on qwen3 dense variants. - Treat empty LLM responses as valid judgments; log a warning on batches >2 so users notice a misbehaving model. - On exception or parse failure, return first N previews (cap=5 or max_filtered_results) to avoid overwhelming downstream. * refactor(relevance_filter): cleanup + add direct tests * feat(relevance_filter): batch previews in parallel for speed and reliability Adds two tunable parameters to the LLM relevance filter: - batch_size: split previews into chunks before sending to the LLM. Each batch uses local indices [0..batch_size-1] mapped back to global. Default 10. Smaller batches are faster per call AND more reliable on weaker models that struggle with many indices in one context. - max_parallel_batches: dispatch batches concurrently via a ThreadPoolExecutor. Default 4. Result order is preserved across parallel batches. Both exposed as BaseSearchEngine class attributes (relevance_filter_batch_size, relevance_filter_max_parallel_batches) so individual engines can override. Failure semantics: - Hard exception on any batch -> capped slice fallback (unchanged). - Parse failure on a single batch -> skip that batch only, keep results from successful batches. Adds 4 direct unit tests covering chunk/index mapping, batch_size=None single-call mode, failed-batch-skip-keeps-others, and parallel dispatch order preservation. All 120 tests pass. * refactor(relevance_filter): drop structured output, parse plain text The Pydantic with_structured_output() path had several issues: - qwen3 dense models returned prose instead of JSON, raising OutputParserException and disabling the filter for that call - grammar-constrained output on Ollama was 6-10x slower than plain text generation (~24s vs ~4s for 50 previews) - per-provider quirks (function_calling latency, schema bikeshedding) Switch to plain llm.invoke() and parse integers from the response with a tightened regex (word-boundary, no decimal fractions). The prompt now instructs the model to output ONLY the indices, which combined with the regex is robust against prose-injection of small numbers. Removes RelevanceResult Pydantic class, _invoke_structured, the _BATCH_FAILED_PARSE sentinel, and the "all batches failed" branch (all dead under the new contract). Updates tests to mock llm.invoke directly. Tightens default batch_size to 5 and parallel batches to 10 based on benchmark runs against Ollama. * docs: fix stale _filter_for_relevance docstring after text-parsing rewrite