* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia, GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via a new `needs_reranking` class attribute. This fixes the priority bug where the global `skip_relevance_filter=True` incorrectly overrode auto-detection for engines that genuinely need filtering. Priority is now: per-engine setting > needs_reranking > global skip. The global skip only affects unclassified engines. Closes #2297 * fix: address 7 code-review issues on needs_reranking branch 1. Rename needs_reranking → needs_llm_relevance_filter for consistency with enable_llm_relevance_filter and skip_relevance_filter naming 2. Fix Paperless dead code: replace non-existent _apply_content_filters with proper _filter_for_relevance() call in custom run() override 3. Fix misleading skip_relevance_filter description to accurately reflect checkbox behavior and keyword engine exceptions 4. Delete 4 vacuously-true inline tests that duplicated factory logic instead of calling the real factory (coverage tests already exist) 5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md 6. Clarify is_generic comment: generic does not imply good ranking 7. Upgrade no-LLM log from debug to warning when filtering was requested but no LLM is available (with should_filter guard) * fix: remove Paperless fallback that overrode valid empty LLM filter results Replace the fallback that restored all previews when the LLM filter returned empty with an info log. The base class _filter_for_relevance() already handles errors internally (returns previews[:5] on exception or JSON parse failure). An empty result means the LLM legitimately found nothing relevant — trust it, don't override it. * refactor: rename needs_llm_relevance_filter → is_lexical The flag describes what the engine IS (lexical/keyword-based search) rather than what it needs. This is a general classification that can drive multiple behaviors beyond just the relevance filter — e.g. query optimization strategies, result deduplication, or UI hints. Matches the existing is_* naming pattern (is_scientific, is_generic). * Revert "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commitc322d478a1. * Reapply "refactor: rename needs_llm_relevance_filter → is_lexical" This reverts commit853dfe90bd. * feat: add is_lexical classification flag alongside needs_llm_relevance_filter Separates classification from behavior: - is_lexical: informational flag indicating the engine uses keyword/lexical search. Reusable for query optimization, UI hints, deduplication, etc. - needs_llm_relevance_filter: behavioral flag that the factory reads to auto-enable LLM relevance filtering on the engine instance. Both flags are set on all 15 keyword-based engines. The factory only checks needs_llm_relevance_filter for filtering decisions. * fix: improve relevance filter error handling and logging - Return [] on all error paths instead of hiding failures behind previews[:5] fallback — failures should be visible, not masked - Log errors at error level (not warning) for LLM parse failures - Add engine name prefix to all log messages for traceability - Add token estimate debug log to help diagnose context overflow - Reduce log noise: routine operations are debug, only summary is info - Consolidate validation into single check * fix: address PR review findings for relevance filter - Fix literal \n in EXTENDING.md code block - Remove 'Maximum results to return' from LLM prompt (LLM decides) - Add INPUT/KEPT/REMOVED debug logging for filter quality analysis - Add is_lexical + needs_llm_relevance_filter to ElasticsearchSearchEngine - Delete vacuously-true test_missing_llm_returns_none test - Downgrade no-op skip_relevance_filter log from info to debug * refactor: extract relevance filter into dedicated module Pull the inline _filter_for_relevance() logic out of BaseSearchEngine into a new web_search_engines/relevance_filter.py module. - Use with_structured_output() with Pydantic schema; let LangChain pick the per-provider default method (JSON schema on Ollama, tool-calling on Anthropic, responseSchema on Gemini). - Trim prompt: drop URLs, cap snippets at 200 chars. - Suppress reasoning on Ollama thinking-by-default models via reasoning=False — saves 30-60s per call on qwen3 dense variants. - Treat empty LLM responses as valid judgments; log a warning on batches >2 so users notice a misbehaving model. - On exception or parse failure, return first N previews (cap=5 or max_filtered_results) to avoid overwhelming downstream. * refactor(relevance_filter): cleanup + add direct tests * feat(relevance_filter): batch previews in parallel for speed and reliability Adds two tunable parameters to the LLM relevance filter: - batch_size: split previews into chunks before sending to the LLM. Each batch uses local indices [0..batch_size-1] mapped back to global. Default 10. Smaller batches are faster per call AND more reliable on weaker models that struggle with many indices in one context. - max_parallel_batches: dispatch batches concurrently via a ThreadPoolExecutor. Default 4. Result order is preserved across parallel batches. Both exposed as BaseSearchEngine class attributes (relevance_filter_batch_size, relevance_filter_max_parallel_batches) so individual engines can override. Failure semantics: - Hard exception on any batch -> capped slice fallback (unchanged). - Parse failure on a single batch -> skip that batch only, keep results from successful batches. Adds 4 direct unit tests covering chunk/index mapping, batch_size=None single-call mode, failed-batch-skip-keeps-others, and parallel dispatch order preservation. All 120 tests pass. * refactor(relevance_filter): drop structured output, parse plain text The Pydantic with_structured_output() path had several issues: - qwen3 dense models returned prose instead of JSON, raising OutputParserException and disabling the filter for that call - grammar-constrained output on Ollama was 6-10x slower than plain text generation (~24s vs ~4s for 50 previews) - per-provider quirks (function_calling latency, schema bikeshedding) Switch to plain llm.invoke() and parse integers from the response with a tightened regex (word-boundary, no decimal fractions). The prompt now instructs the model to output ONLY the indices, which combined with the regex is robust against prose-injection of small numbers. Removes RelevanceResult Pydantic class, _invoke_structured, the _BATCH_FAILED_PARSE sentinel, and the "all batches failed" branch (all dead under the new contract). Updates tests to mock llm.invoke directly. Tightens default batch_size to 5 and parallel batches to 10 based on benchmark runs against Ollama. * docs: fix stale _filter_for_relevance docstring after text-parsing rewrite
18 KiB
Extension Guide
This guide explains how to extend Local Deep Research with custom components.
Table of Contents
- Adding Custom Search Engines
- Adding Custom Search Strategies
- Using LangChain Retrievers
- Adding Custom LLM Providers
- Registering Custom LLMs
Adding Custom Search Engines
Search engines are responsible for fetching results from external sources. All engines extend BaseSearchEngine.
Basic Search Engine
Create a new file in src/local_deep_research/web_search_engines/engines/:
# search_engine_custom.py
from typing import Any, Dict, List, Optional
from langchain_core.language_models import BaseLLM
from loguru import logger
from ..search_engine_base import BaseSearchEngine
class CustomSearchEngine(BaseSearchEngine):
"""Custom search engine implementation."""
# Classification flags - set appropriately for your engine
is_public = True # Searches public internet
is_generic = False # Specialized (vs general web search)
is_scientific = False # Academic/scientific content
is_local = False # Local document search
is_news = False # News content
is_code = False # Code repositories
is_lexical = False # Uses keyword/lexical search (informational)
needs_llm_relevance_filter = False # Set True to auto-enable LLM relevance filtering
def __init__(
self,
max_results: int = 10,
credential: Optional[str] = None,
llm: Optional[BaseLLM] = None,
max_filtered_results: Optional[int] = None,
**kwargs,
):
"""
Initialize the search engine.
Args:
max_results: Maximum number of results to return
credential: API credential for the service (if required)
llm: Language model for relevance filtering
max_filtered_results: Max results after filtering
**kwargs: Additional parameters
"""
super().__init__(
llm=llm,
max_filtered_results=max_filtered_results,
max_results=max_results,
)
self.credential = credential
def _get_previews(self, query: str) -> List[Dict[str, Any]]:
"""
Get preview results (first phase of two-phase retrieval).
Args:
query: Search query
Returns:
List of preview dictionaries with keys:
- id: Unique identifier
- title: Result title
- snippet: Brief description/summary
- link: URL to the content
- source: Source name (e.g., "CustomEngine")
"""
logger.info(f"Searching custom engine for: {query}")
# Apply rate limiting before request
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
# Your search implementation here
results = self._call_api(query)
previews = []
for item in results:
previews.append({
"id": item["id"],
"title": item["title"],
"snippet": item["description"],
"link": item["url"],
"source": "CustomEngine",
})
return previews
def _get_full_content(
self, relevant_items: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""
Get full content for relevant items (second phase).
Args:
relevant_items: Items that passed relevance filtering
Returns:
Items enriched with full content
"""
results = []
for item in relevant_items:
# Apply rate limiting
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
# Fetch full content
full_content = self._fetch_content(item["link"])
result = item.copy()
result["content"] = full_content
result["full_content"] = full_content
results.append(result)
return results
def _call_api(self, query: str) -> List[Dict]:
"""Your API implementation."""
# Implement your search logic here
pass
def _fetch_content(self, url: str) -> str:
"""Fetch full content from URL."""
# Implement content fetching
pass
Registering the Engine
Option 1: Register in engine_registry.py (Required)
Add the engine to src/local_deep_research/web_search_engines/engine_registry.py so the system knows how to load it. The registry maps engine names to their Python module and class:
# In engine_registry.py — ENGINE_REGISTRY dict
"custom_engine": EngineEntry(
module_path=".engines.search_engine_custom",
class_name="CustomSearchEngine",
),
Module paths must be relative (starting with .) and listed in the security whitelist (ALLOWED_MODULE_PATHS in module_whitelist.py).
Option 1b: Configure user-facing settings (Optional)
After registering in the engine registry, you can expose user-configurable settings via the settings database:
# Key: search.engine.web.custom_engine
config = {
"requires_api_key": True,
"requires_llm": False,
"description": "Custom search engine for specific use case",
"strengths": ["Feature 1", "Feature 2"],
"weaknesses": ["Limitation 1"],
"reliability": 0.8,
"default_params": {
"max_results": 10
}
}
Option 2: Modify Factory (For Core Engines)
Add to search_engine_factory.py:
def create_search_engine(engine_name: str, ...) -> BaseSearchEngine:
# ... existing code ...
if engine_name.lower() == "custom_engine":
from .engines.search_engine_custom import CustomSearchEngine
return CustomSearchEngine(
max_results=max_results,
api_key=api_key,
llm=llm,
**kwargs
)
Search Engine Best Practices
-
Always apply rate limiting before API calls:
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type) -
Set classification flags accurately - they affect engine selection. For keyword-based engines without ML ranking, set
is_lexical = Trueandneeds_llm_relevance_filter = True— the factory will auto-enable LLM relevance filtering -
Handle errors gracefully - return empty list on failure, don't crash
-
Use logging for debugging:
from loguru import logger logger.info(f"Searching for: {query}") logger.error(f"API error: {e}") -
Support snippet-only mode by checking the config:
from ...config import search_config if search_config.SEARCH_SNIPPETS_ONLY: return relevant_items # Skip full content
Adding Custom Search Strategies
Strategies define how research is conducted - question generation, iteration, and synthesis.
Basic Strategy
Create a new file in src/local_deep_research/advanced_search_system/strategies/:
# my_custom_strategy.py
from typing import Dict, List, Optional
from loguru import logger
from .base_strategy import BaseSearchStrategy
class MyCustomStrategy(BaseSearchStrategy):
"""Custom search strategy implementation."""
def __init__(
self,
search=None,
model=None,
all_links_of_system=None,
settings_snapshot=None,
max_iterations: int = 3,
**kwargs,
):
"""
Initialize the strategy.
Args:
search: Search engine instance
model: LLM for question generation and synthesis
all_links_of_system: Shared list for discovered links
settings_snapshot: Configuration snapshot
max_iterations: Maximum research iterations
**kwargs: Additional parameters
"""
super().__init__(
all_links_of_system=all_links_of_system,
settings_snapshot=settings_snapshot,
)
self.search = search
self.model = model
self.max_iterations = max_iterations
def analyze_topic(self, query: str) -> Dict:
"""
Execute the research strategy.
Args:
query: Research query
Returns:
Dict with:
- findings: List of research findings
- iterations: Number of iterations completed
- questions: Dict of questions by iteration
- formatted_findings: Formatted output string
- current_knowledge: Accumulated knowledge dict
- error: Optional error message
"""
logger.info(f"Starting custom strategy for: {query}")
findings = []
current_knowledge = {}
try:
for iteration in range(1, self.max_iterations + 1):
# Update progress
self._update_progress(
f"Iteration {iteration}/{self.max_iterations}",
progress_percent=int(iteration / self.max_iterations * 100),
metadata={"iteration": iteration}
)
# Generate questions for this iteration
questions = self._generate_questions(query, current_knowledge)
self.questions_by_iteration[iteration] = questions
# Search for each question
for question in questions:
results = self._search(question)
findings.extend(results)
# Track links
for result in results:
if result.get("link"):
self.all_links_of_system.append(result["link"])
# Synthesize findings
current_knowledge = self._synthesize(findings)
# Check if we should stop early
if self._should_stop(current_knowledge):
logger.info(f"Early stopping at iteration {iteration}")
break
# Format final output
formatted = self._format_findings(findings, current_knowledge)
return {
"findings": findings,
"iterations": iteration,
"questions": self.questions_by_iteration,
"formatted_findings": formatted,
"current_knowledge": current_knowledge,
}
except Exception as e:
logger.error(f"Strategy error: {e}")
return {
"findings": findings,
"iterations": 0,
"questions": self.questions_by_iteration,
"formatted_findings": "",
"current_knowledge": current_knowledge,
"error": str(e),
}
def _generate_questions(self, query: str, knowledge: Dict) -> List[str]:
"""Generate research questions using the LLM."""
prompt = f"""Given the query: {query}
And current knowledge: {knowledge}
Generate 3 specific research questions."""
response = self.model.invoke(prompt)
# Parse response into questions
return self._parse_questions(response.content)
def _search(self, question: str) -> List[Dict]:
"""Execute search for a question."""
return self.search.run(question)
def _synthesize(self, findings: List[Dict]) -> Dict:
"""Synthesize findings into knowledge."""
# Implement synthesis logic
return {"summary": "...", "key_points": [...]}
def _should_stop(self, knowledge: Dict) -> bool:
"""Check if research should stop early."""
# Implement stopping criteria
return False
def _format_findings(self, findings: List[Dict], knowledge: Dict) -> str:
"""Format findings as output string."""
# Implement formatting
return "Formatted research results..."
def _parse_questions(self, content: str) -> List[str]:
"""Parse LLM response into question list."""
# Implement parsing
return content.strip().split("\n")
Registering the Strategy
Add to search_system_factory.py:
def create_strategy(strategy_name: str, ...) -> BaseSearchStrategy:
strategy_name_lower = strategy_name.lower()
# ... existing strategies ...
elif strategy_name_lower in ["my-custom", "mycustom", "custom"]:
from .advanced_search_system.strategies.my_custom_strategy import (
MyCustomStrategy,
)
return MyCustomStrategy(
search=search,
model=model,
all_links_of_system=all_links_of_system,
settings_snapshot=settings_snapshot,
**kwargs
)
Strategy Best Practices
-
Use progress callbacks to update the UI:
self._update_progress("Searching...", progress_percent=50) -
Track all discovered links in
self.all_links_of_system -
Store questions by iteration in
self.questions_by_iteration -
Access settings via the snapshot:
max_results = self.get_setting("search.max_results", default=10) -
Handle errors gracefully - return partial results with error message
Using LangChain Retrievers
The easiest way to add custom search is through LangChain retrievers.
Registering a Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from local_deep_research.web_search_engines.retriever_registry import retriever_registry
# Create your retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Register globally
retriever_registry.register("my_documents", retriever)
# Now use in research
from local_deep_research.api import quick_summary
result = quick_summary(
query="What does the documentation say about X?",
search_tool="my_documents", # Use registered retriever
programmatic_mode=True
)
Passing Retrievers Directly
from local_deep_research.api import quick_summary
# Create retriever
retriever = my_vectorstore.as_retriever()
# Pass directly to API
result = quick_summary(
query="Search my documents",
retrievers={"private_docs": retriever},
search_tool="private_docs",
programmatic_mode=True
)
Registry Methods
from local_deep_research.web_search_engines.retriever_registry import retriever_registry
# Register
retriever_registry.register("name", retriever)
retriever_registry.register_multiple({"a": ret1, "b": ret2})
# Query
retriever_registry.get("name")
retriever_registry.is_registered("name")
retriever_registry.list_registered()
# Remove
retriever_registry.unregister("name")
retriever_registry.clear()
Adding Custom LLM Providers
LLM providers wrap language model APIs for use in LDR.
Basic Provider
Create in src/local_deep_research/llm/providers/implementations/:
# my_provider.py
from typing import Any, Dict, Optional
from langchain_core.language_models import BaseChatModel
from langchain_openai import ChatOpenAI
from ..openai_compatible import OpenAICompatibleProvider
class MyProvider(OpenAICompatibleProvider):
"""Custom LLM provider."""
provider_name = "My Provider"
api_key_setting = "llm.my_provider.api_key"
url_setting = "llm.my_provider.url"
default_base_url = "https://api.myprovider.com/v1"
default_model = "my-model-v1"
@classmethod
def create_llm(
cls,
model_name: Optional[str] = None,
temperature: float = 0.7,
settings_snapshot: Optional[Dict] = None,
**kwargs
) -> BaseChatModel:
"""
Create LLM instance.
Args:
model_name: Model to use
temperature: Sampling temperature
settings_snapshot: Configuration
**kwargs: Additional parameters
Returns:
LangChain chat model instance
"""
settings_snapshot = settings_snapshot or {}
# Get API key from settings
api_key = cls._get_setting(settings_snapshot, cls.api_key_setting)
if not api_key:
raise ValueError(f"API key not found in {cls.api_key_setting}")
# Get base URL
base_url = cls._get_setting(
settings_snapshot, cls.url_setting, cls.default_base_url
)
return ChatOpenAI(
model=model_name or cls.default_model,
temperature=temperature,
api_key=api_key,
base_url=base_url,
**kwargs
)
@classmethod
def list_models(cls, settings_snapshot: Optional[Dict] = None) -> list[str]:
"""List available models."""
return ["my-model-v1", "my-model-v2", "my-model-large"]
Register in Auto-Discovery
Add to auto_discovery.py:
PROVIDER_METADATA = {
# ... existing providers ...
"my_provider": ProviderMetadata(
provider_id="my_provider",
provider_name="My Provider",
company_name="My Company",
region="US",
country="United States",
data_location="US",
gdpr_compliant=False,
is_cloud=True,
),
}
Registering Custom LLMs
For programmatic use, register LLMs directly:
from langchain_openai import ChatOpenAI
from local_deep_research.llm.llm_registry import register_llm, get_llm_from_registry
# Create custom LLM
custom_llm = ChatOpenAI(
model="gpt-4",
temperature=0.5,
api_key="...",
)
# Register it
register_llm("my_gpt4", custom_llm)
# Use in research
from local_deep_research.api import quick_summary
result = quick_summary(
query="Research topic",
llms={"my_gpt4": custom_llm}, # Or use registered name
provider_name="my_gpt4",
programmatic_mode=True
)
Factory Functions
You can also register factory functions:
def create_my_llm(temperature=0.7):
return ChatOpenAI(model="gpt-4", temperature=temperature)
register_llm("my_factory", create_my_llm)
# Will be called when needed
llm = get_llm_from_registry("my_factory")
See Also
- Architecture Overview - System architecture
- Database Schema - Data models
- Full Configuration Reference - All settings and environment variables
- Troubleshooting - Common issues
- API Quickstart - Using the API