Files
local-deep-research/docs/developing/EXTENDING.md
LearningCircuit 061cd83dd4 feat: add is_lexical flag to auto-enable LLM relevance filtering for keyword-based engines (#3403)
* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines

Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia,
GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via
a new `needs_reranking` class attribute. This fixes the priority bug
where the global `skip_relevance_filter=True` incorrectly overrode
auto-detection for engines that genuinely need filtering.

Priority is now: per-engine setting > needs_reranking > global skip.
The global skip only affects unclassified engines.

Closes #2297

* fix: address 7 code-review issues on needs_reranking branch

1. Rename needs_reranking → needs_llm_relevance_filter for consistency
   with enable_llm_relevance_filter and skip_relevance_filter naming
2. Fix Paperless dead code: replace non-existent _apply_content_filters
   with proper _filter_for_relevance() call in custom run() override
3. Fix misleading skip_relevance_filter description to accurately
   reflect checkbox behavior and keyword engine exceptions
4. Delete 4 vacuously-true inline tests that duplicated factory logic
   instead of calling the real factory (coverage tests already exist)
5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md
6. Clarify is_generic comment: generic does not imply good ranking
7. Upgrade no-LLM log from debug to warning when filtering was
   requested but no LLM is available (with should_filter guard)

* fix: remove Paperless fallback that overrode valid empty LLM filter results

Replace the fallback that restored all previews when the LLM filter
returned empty with an info log. The base class _filter_for_relevance()
already handles errors internally (returns previews[:5] on exception
or JSON parse failure). An empty result means the LLM legitimately
found nothing relevant — trust it, don't override it.

* refactor: rename needs_llm_relevance_filter → is_lexical

The flag describes what the engine IS (lexical/keyword-based search)
rather than what it needs. This is a general classification that can
drive multiple behaviors beyond just the relevance filter — e.g.
query optimization strategies, result deduplication, or UI hints.
Matches the existing is_* naming pattern (is_scientific, is_generic).

* Revert "refactor: rename needs_llm_relevance_filter → is_lexical"

This reverts commit c322d478a1.

* Reapply "refactor: rename needs_llm_relevance_filter → is_lexical"

This reverts commit 853dfe90bd.

* feat: add is_lexical classification flag alongside needs_llm_relevance_filter

Separates classification from behavior:
- is_lexical: informational flag indicating the engine uses keyword/lexical
  search. Reusable for query optimization, UI hints, deduplication, etc.
- needs_llm_relevance_filter: behavioral flag that the factory reads to
  auto-enable LLM relevance filtering on the engine instance.

Both flags are set on all 15 keyword-based engines. The factory only
checks needs_llm_relevance_filter for filtering decisions.

* fix: improve relevance filter error handling and logging

- Return [] on all error paths instead of hiding failures behind
  previews[:5] fallback — failures should be visible, not masked
- Log errors at error level (not warning) for LLM parse failures
- Add engine name prefix to all log messages for traceability
- Add token estimate debug log to help diagnose context overflow
- Reduce log noise: routine operations are debug, only summary is info
- Consolidate validation into single check

* fix: address PR review findings for relevance filter

- Fix literal \n in EXTENDING.md code block
- Remove 'Maximum results to return' from LLM prompt (LLM decides)
- Add INPUT/KEPT/REMOVED debug logging for filter quality analysis
- Add is_lexical + needs_llm_relevance_filter to ElasticsearchSearchEngine
- Delete vacuously-true test_missing_llm_returns_none test
- Downgrade no-op skip_relevance_filter log from info to debug

* refactor: extract relevance filter into dedicated module

Pull the inline _filter_for_relevance() logic out of BaseSearchEngine
into a new web_search_engines/relevance_filter.py module.

- Use with_structured_output() with Pydantic schema; let LangChain
  pick the per-provider default method (JSON schema on Ollama,
  tool-calling on Anthropic, responseSchema on Gemini).
- Trim prompt: drop URLs, cap snippets at 200 chars.
- Suppress reasoning on Ollama thinking-by-default models via
  reasoning=False — saves 30-60s per call on qwen3 dense variants.
- Treat empty LLM responses as valid judgments; log a warning on
  batches >2 so users notice a misbehaving model.
- On exception or parse failure, return first N previews (cap=5 or
  max_filtered_results) to avoid overwhelming downstream.

* refactor(relevance_filter): cleanup + add direct tests

* feat(relevance_filter): batch previews in parallel for speed and reliability

Adds two tunable parameters to the LLM relevance filter:

- batch_size: split previews into chunks before sending to the LLM.
  Each batch uses local indices [0..batch_size-1] mapped back to
  global. Default 10. Smaller batches are faster per call AND more
  reliable on weaker models that struggle with many indices in one
  context.

- max_parallel_batches: dispatch batches concurrently via a
  ThreadPoolExecutor. Default 4. Result order is preserved across
  parallel batches.

Both exposed as BaseSearchEngine class attributes
(relevance_filter_batch_size, relevance_filter_max_parallel_batches)
so individual engines can override.

Failure semantics:
- Hard exception on any batch -> capped slice fallback (unchanged).
- Parse failure on a single batch -> skip that batch only, keep
  results from successful batches.

Adds 4 direct unit tests covering chunk/index mapping, batch_size=None
single-call mode, failed-batch-skip-keeps-others, and parallel dispatch
order preservation. All 120 tests pass.

* refactor(relevance_filter): drop structured output, parse plain text

The Pydantic with_structured_output() path had several issues:
- qwen3 dense models returned prose instead of JSON, raising
  OutputParserException and disabling the filter for that call
- grammar-constrained output on Ollama was 6-10x slower than plain
  text generation (~24s vs ~4s for 50 previews)
- per-provider quirks (function_calling latency, schema bikeshedding)

Switch to plain llm.invoke() and parse integers from the response with
a tightened regex (word-boundary, no decimal fractions). The prompt
now instructs the model to output ONLY the indices, which combined
with the regex is robust against prose-injection of small numbers.

Removes RelevanceResult Pydantic class, _invoke_structured, the
_BATCH_FAILED_PARSE sentinel, and the "all batches failed" branch
(all dead under the new contract). Updates tests to mock llm.invoke
directly. Tightens default batch_size to 5 and parallel batches to 10
based on benchmark runs against Ollama.

* docs: fix stale _filter_for_relevance docstring after text-parsing rewrite
2026-04-06 23:04:47 +02:00

637 lines
18 KiB
Markdown

# Extension Guide
This guide explains how to extend Local Deep Research with custom components.
## Table of Contents
- [Adding Custom Search Engines](#adding-custom-search-engines)
- [Adding Custom Search Strategies](#adding-custom-search-strategies)
- [Using LangChain Retrievers](#using-langchain-retrievers)
- [Adding Custom LLM Providers](#adding-custom-llm-providers)
- [Registering Custom LLMs](#registering-custom-llms)
---
## Adding Custom Search Engines
Search engines are responsible for fetching results from external sources. All engines extend `BaseSearchEngine`.
### Basic Search Engine
Create a new file in `src/local_deep_research/web_search_engines/engines/`:
```python
# search_engine_custom.py
from typing import Any, Dict, List, Optional
from langchain_core.language_models import BaseLLM
from loguru import logger
from ..search_engine_base import BaseSearchEngine
class CustomSearchEngine(BaseSearchEngine):
"""Custom search engine implementation."""
# Classification flags - set appropriately for your engine
is_public = True # Searches public internet
is_generic = False # Specialized (vs general web search)
is_scientific = False # Academic/scientific content
is_local = False # Local document search
is_news = False # News content
is_code = False # Code repositories
is_lexical = False # Uses keyword/lexical search (informational)
needs_llm_relevance_filter = False # Set True to auto-enable LLM relevance filtering
def __init__(
self,
max_results: int = 10,
credential: Optional[str] = None,
llm: Optional[BaseLLM] = None,
max_filtered_results: Optional[int] = None,
**kwargs,
):
"""
Initialize the search engine.
Args:
max_results: Maximum number of results to return
credential: API credential for the service (if required)
llm: Language model for relevance filtering
max_filtered_results: Max results after filtering
**kwargs: Additional parameters
"""
super().__init__(
llm=llm,
max_filtered_results=max_filtered_results,
max_results=max_results,
)
self.credential = credential
def _get_previews(self, query: str) -> List[Dict[str, Any]]:
"""
Get preview results (first phase of two-phase retrieval).
Args:
query: Search query
Returns:
List of preview dictionaries with keys:
- id: Unique identifier
- title: Result title
- snippet: Brief description/summary
- link: URL to the content
- source: Source name (e.g., "CustomEngine")
"""
logger.info(f"Searching custom engine for: {query}")
# Apply rate limiting before request
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
# Your search implementation here
results = self._call_api(query)
previews = []
for item in results:
previews.append({
"id": item["id"],
"title": item["title"],
"snippet": item["description"],
"link": item["url"],
"source": "CustomEngine",
})
return previews
def _get_full_content(
self, relevant_items: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
"""
Get full content for relevant items (second phase).
Args:
relevant_items: Items that passed relevance filtering
Returns:
Items enriched with full content
"""
results = []
for item in relevant_items:
# Apply rate limiting
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
# Fetch full content
full_content = self._fetch_content(item["link"])
result = item.copy()
result["content"] = full_content
result["full_content"] = full_content
results.append(result)
return results
def _call_api(self, query: str) -> List[Dict]:
"""Your API implementation."""
# Implement your search logic here
pass
def _fetch_content(self, url: str) -> str:
"""Fetch full content from URL."""
# Implement content fetching
pass
```
### Registering the Engine
**Option 1: Register in engine_registry.py (Required)**
Add the engine to `src/local_deep_research/web_search_engines/engine_registry.py` so the system knows how to load it. The registry maps engine names to their Python module and class:
```python
# In engine_registry.py — ENGINE_REGISTRY dict
"custom_engine": EngineEntry(
module_path=".engines.search_engine_custom",
class_name="CustomSearchEngine",
),
```
Module paths must be relative (starting with `.`) and listed in the security whitelist (`ALLOWED_MODULE_PATHS` in `module_whitelist.py`).
**Option 1b: Configure user-facing settings (Optional)**
After registering in the engine registry, you can expose user-configurable settings via the settings database:
```python
# Key: search.engine.web.custom_engine
config = {
"requires_api_key": True,
"requires_llm": False,
"description": "Custom search engine for specific use case",
"strengths": ["Feature 1", "Feature 2"],
"weaknesses": ["Limitation 1"],
"reliability": 0.8,
"default_params": {
"max_results": 10
}
}
```
**Option 2: Modify Factory (For Core Engines)**
Add to `search_engine_factory.py`:
```python
def create_search_engine(engine_name: str, ...) -> BaseSearchEngine:
# ... existing code ...
if engine_name.lower() == "custom_engine":
from .engines.search_engine_custom import CustomSearchEngine
return CustomSearchEngine(
max_results=max_results,
api_key=api_key,
llm=llm,
**kwargs
)
```
### Search Engine Best Practices
1. **Always apply rate limiting** before API calls:
```python
self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
```
2. **Set classification flags** accurately - they affect engine selection. For keyword-based engines without ML ranking, set `is_lexical = True` and `needs_llm_relevance_filter = True` — the factory will auto-enable LLM relevance filtering
3. **Handle errors gracefully** - return empty list on failure, don't crash
4. **Use logging** for debugging:
```python
from loguru import logger
logger.info(f"Searching for: {query}")
logger.error(f"API error: {e}")
```
5. **Support snippet-only mode** by checking the config:
```python
from ...config import search_config
if search_config.SEARCH_SNIPPETS_ONLY:
return relevant_items # Skip full content
```
---
## Adding Custom Search Strategies
Strategies define how research is conducted - question generation, iteration, and synthesis.
### Basic Strategy
Create a new file in `src/local_deep_research/advanced_search_system/strategies/`:
```python
# my_custom_strategy.py
from typing import Dict, List, Optional
from loguru import logger
from .base_strategy import BaseSearchStrategy
class MyCustomStrategy(BaseSearchStrategy):
"""Custom search strategy implementation."""
def __init__(
self,
search=None,
model=None,
all_links_of_system=None,
settings_snapshot=None,
max_iterations: int = 3,
**kwargs,
):
"""
Initialize the strategy.
Args:
search: Search engine instance
model: LLM for question generation and synthesis
all_links_of_system: Shared list for discovered links
settings_snapshot: Configuration snapshot
max_iterations: Maximum research iterations
**kwargs: Additional parameters
"""
super().__init__(
all_links_of_system=all_links_of_system,
settings_snapshot=settings_snapshot,
)
self.search = search
self.model = model
self.max_iterations = max_iterations
def analyze_topic(self, query: str) -> Dict:
"""
Execute the research strategy.
Args:
query: Research query
Returns:
Dict with:
- findings: List of research findings
- iterations: Number of iterations completed
- questions: Dict of questions by iteration
- formatted_findings: Formatted output string
- current_knowledge: Accumulated knowledge dict
- error: Optional error message
"""
logger.info(f"Starting custom strategy for: {query}")
findings = []
current_knowledge = {}
try:
for iteration in range(1, self.max_iterations + 1):
# Update progress
self._update_progress(
f"Iteration {iteration}/{self.max_iterations}",
progress_percent=int(iteration / self.max_iterations * 100),
metadata={"iteration": iteration}
)
# Generate questions for this iteration
questions = self._generate_questions(query, current_knowledge)
self.questions_by_iteration[iteration] = questions
# Search for each question
for question in questions:
results = self._search(question)
findings.extend(results)
# Track links
for result in results:
if result.get("link"):
self.all_links_of_system.append(result["link"])
# Synthesize findings
current_knowledge = self._synthesize(findings)
# Check if we should stop early
if self._should_stop(current_knowledge):
logger.info(f"Early stopping at iteration {iteration}")
break
# Format final output
formatted = self._format_findings(findings, current_knowledge)
return {
"findings": findings,
"iterations": iteration,
"questions": self.questions_by_iteration,
"formatted_findings": formatted,
"current_knowledge": current_knowledge,
}
except Exception as e:
logger.error(f"Strategy error: {e}")
return {
"findings": findings,
"iterations": 0,
"questions": self.questions_by_iteration,
"formatted_findings": "",
"current_knowledge": current_knowledge,
"error": str(e),
}
def _generate_questions(self, query: str, knowledge: Dict) -> List[str]:
"""Generate research questions using the LLM."""
prompt = f"""Given the query: {query}
And current knowledge: {knowledge}
Generate 3 specific research questions."""
response = self.model.invoke(prompt)
# Parse response into questions
return self._parse_questions(response.content)
def _search(self, question: str) -> List[Dict]:
"""Execute search for a question."""
return self.search.run(question)
def _synthesize(self, findings: List[Dict]) -> Dict:
"""Synthesize findings into knowledge."""
# Implement synthesis logic
return {"summary": "...", "key_points": [...]}
def _should_stop(self, knowledge: Dict) -> bool:
"""Check if research should stop early."""
# Implement stopping criteria
return False
def _format_findings(self, findings: List[Dict], knowledge: Dict) -> str:
"""Format findings as output string."""
# Implement formatting
return "Formatted research results..."
def _parse_questions(self, content: str) -> List[str]:
"""Parse LLM response into question list."""
# Implement parsing
return content.strip().split("\n")
```
### Registering the Strategy
Add to `search_system_factory.py`:
```python
def create_strategy(strategy_name: str, ...) -> BaseSearchStrategy:
strategy_name_lower = strategy_name.lower()
# ... existing strategies ...
elif strategy_name_lower in ["my-custom", "mycustom", "custom"]:
from .advanced_search_system.strategies.my_custom_strategy import (
MyCustomStrategy,
)
return MyCustomStrategy(
search=search,
model=model,
all_links_of_system=all_links_of_system,
settings_snapshot=settings_snapshot,
**kwargs
)
```
### Strategy Best Practices
1. **Use progress callbacks** to update the UI:
```python
self._update_progress("Searching...", progress_percent=50)
```
2. **Track all discovered links** in `self.all_links_of_system`
3. **Store questions by iteration** in `self.questions_by_iteration`
4. **Access settings** via the snapshot:
```python
max_results = self.get_setting("search.max_results", default=10)
```
5. **Handle errors gracefully** - return partial results with error message
---
## Using LangChain Retrievers
The easiest way to add custom search is through LangChain retrievers.
### Registering a Retriever
```python
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from local_deep_research.web_search_engines.retriever_registry import retriever_registry
# Create your retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Register globally
retriever_registry.register("my_documents", retriever)
# Now use in research
from local_deep_research.api import quick_summary
result = quick_summary(
query="What does the documentation say about X?",
search_tool="my_documents", # Use registered retriever
programmatic_mode=True
)
```
### Passing Retrievers Directly
```python
from local_deep_research.api import quick_summary
# Create retriever
retriever = my_vectorstore.as_retriever()
# Pass directly to API
result = quick_summary(
query="Search my documents",
retrievers={"private_docs": retriever},
search_tool="private_docs",
programmatic_mode=True
)
```
### Registry Methods
```python
from local_deep_research.web_search_engines.retriever_registry import retriever_registry
# Register
retriever_registry.register("name", retriever)
retriever_registry.register_multiple({"a": ret1, "b": ret2})
# Query
retriever_registry.get("name")
retriever_registry.is_registered("name")
retriever_registry.list_registered()
# Remove
retriever_registry.unregister("name")
retriever_registry.clear()
```
---
## Adding Custom LLM Providers
LLM providers wrap language model APIs for use in LDR.
### Basic Provider
Create in `src/local_deep_research/llm/providers/implementations/`:
```python
# my_provider.py
from typing import Any, Dict, Optional
from langchain_core.language_models import BaseChatModel
from langchain_openai import ChatOpenAI
from ..openai_compatible import OpenAICompatibleProvider
class MyProvider(OpenAICompatibleProvider):
"""Custom LLM provider."""
provider_name = "My Provider"
api_key_setting = "llm.my_provider.api_key"
url_setting = "llm.my_provider.url"
default_base_url = "https://api.myprovider.com/v1"
default_model = "my-model-v1"
@classmethod
def create_llm(
cls,
model_name: Optional[str] = None,
temperature: float = 0.7,
settings_snapshot: Optional[Dict] = None,
**kwargs
) -> BaseChatModel:
"""
Create LLM instance.
Args:
model_name: Model to use
temperature: Sampling temperature
settings_snapshot: Configuration
**kwargs: Additional parameters
Returns:
LangChain chat model instance
"""
settings_snapshot = settings_snapshot or {}
# Get API key from settings
api_key = cls._get_setting(settings_snapshot, cls.api_key_setting)
if not api_key:
raise ValueError(f"API key not found in {cls.api_key_setting}")
# Get base URL
base_url = cls._get_setting(
settings_snapshot, cls.url_setting, cls.default_base_url
)
return ChatOpenAI(
model=model_name or cls.default_model,
temperature=temperature,
api_key=api_key,
base_url=base_url,
**kwargs
)
@classmethod
def list_models(cls, settings_snapshot: Optional[Dict] = None) -> list[str]:
"""List available models."""
return ["my-model-v1", "my-model-v2", "my-model-large"]
```
### Register in Auto-Discovery
Add to `auto_discovery.py`:
```python
PROVIDER_METADATA = {
# ... existing providers ...
"my_provider": ProviderMetadata(
provider_id="my_provider",
provider_name="My Provider",
company_name="My Company",
region="US",
country="United States",
data_location="US",
gdpr_compliant=False,
is_cloud=True,
),
}
```
---
## Registering Custom LLMs
For programmatic use, register LLMs directly:
```python
from langchain_openai import ChatOpenAI
from local_deep_research.llm.llm_registry import register_llm, get_llm_from_registry
# Create custom LLM
custom_llm = ChatOpenAI(
model="gpt-4",
temperature=0.5,
api_key="...",
)
# Register it
register_llm("my_gpt4", custom_llm)
# Use in research
from local_deep_research.api import quick_summary
result = quick_summary(
query="Research topic",
llms={"my_gpt4": custom_llm}, # Or use registered name
provider_name="my_gpt4",
programmatic_mode=True
)
```
### Factory Functions
You can also register factory functions:
```python
def create_my_llm(temperature=0.7):
return ChatOpenAI(model="gpt-4", temperature=temperature)
register_llm("my_factory", create_my_llm)
# Will be called when needed
llm = get_llm_from_registry("my_factory")
```
---
## See Also
- [Architecture Overview](../architecture/OVERVIEW.md) - System architecture
- [Database Schema](../architecture/DATABASE_SCHEMA.md) - Data models
- [Full Configuration Reference](../CONFIGURATION.md) - All settings and environment variables
- [Troubleshooting](../troubleshooting.md) - Common issues
- [API Quickstart](../api-quickstart.md) - Using the API