mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-16 03:51:07 +03:00

Files

LearningCircuit 061cd83dd4 feat: add is_lexical flag to auto-enable LLM relevance filtering for keyword-based engines (#3403 )

* feat: add needs_reranking flag to auto-enable LLM relevance filtering for keyword-based engines

Engines with poor native relevance ranking (arXiv, PubMed, Wikipedia,
GitHub, Mojeek, etc.) now auto-enable LLM-based result filtering via
a new `needs_reranking` class attribute. This fixes the priority bug
where the global `skip_relevance_filter=True` incorrectly overrode
auto-detection for engines that genuinely need filtering.

Priority is now: per-engine setting > needs_reranking > global skip.
The global skip only affects unclassified engines.

Closes #2297

* fix: address 7 code-review issues on needs_reranking branch

1. Rename needs_reranking → needs_llm_relevance_filter for consistency
   with enable_llm_relevance_filter and skip_relevance_filter naming
2. Fix Paperless dead code: replace non-existent _apply_content_filters
   with proper _filter_for_relevance() call in custom run() override
3. Fix misleading skip_relevance_filter description to accurately
   reflect checkbox behavior and keyword engine exceptions
4. Delete 4 vacuously-true inline tests that duplicated factory logic
   instead of calling the real factory (coverage tests already exist)
5. Add needs_llm_relevance_filter to EXTENDING.md and OVERVIEW.md
6. Clarify is_generic comment: generic does not imply good ranking
7. Upgrade no-LLM log from debug to warning when filtering was
   requested but no LLM is available (with should_filter guard)

* fix: remove Paperless fallback that overrode valid empty LLM filter results

Replace the fallback that restored all previews when the LLM filter
returned empty with an info log. The base class _filter_for_relevance()
already handles errors internally (returns previews[:5] on exception
or JSON parse failure). An empty result means the LLM legitimately
found nothing relevant — trust it, don't override it.

* refactor: rename needs_llm_relevance_filter → is_lexical

The flag describes what the engine IS (lexical/keyword-based search)
rather than what it needs. This is a general classification that can
drive multiple behaviors beyond just the relevance filter — e.g.
query optimization strategies, result deduplication, or UI hints.
Matches the existing is_* naming pattern (is_scientific, is_generic).

* Revert "refactor: rename needs_llm_relevance_filter → is_lexical"

This reverts commit c322d478a1.

* Reapply "refactor: rename needs_llm_relevance_filter → is_lexical"

This reverts commit 853dfe90bd.

* feat: add is_lexical classification flag alongside needs_llm_relevance_filter

Separates classification from behavior:
- is_lexical: informational flag indicating the engine uses keyword/lexical
  search. Reusable for query optimization, UI hints, deduplication, etc.
- needs_llm_relevance_filter: behavioral flag that the factory reads to
  auto-enable LLM relevance filtering on the engine instance.

Both flags are set on all 15 keyword-based engines. The factory only
checks needs_llm_relevance_filter for filtering decisions.

* fix: improve relevance filter error handling and logging

- Return [] on all error paths instead of hiding failures behind
  previews[:5] fallback — failures should be visible, not masked
- Log errors at error level (not warning) for LLM parse failures
- Add engine name prefix to all log messages for traceability
- Add token estimate debug log to help diagnose context overflow
- Reduce log noise: routine operations are debug, only summary is info
- Consolidate validation into single check

* fix: address PR review findings for relevance filter

- Fix literal \n in EXTENDING.md code block
- Remove 'Maximum results to return' from LLM prompt (LLM decides)
- Add INPUT/KEPT/REMOVED debug logging for filter quality analysis
- Add is_lexical + needs_llm_relevance_filter to ElasticsearchSearchEngine
- Delete vacuously-true test_missing_llm_returns_none test
- Downgrade no-op skip_relevance_filter log from info to debug

* refactor: extract relevance filter into dedicated module

Pull the inline _filter_for_relevance() logic out of BaseSearchEngine
into a new web_search_engines/relevance_filter.py module.

- Use with_structured_output() with Pydantic schema; let LangChain
  pick the per-provider default method (JSON schema on Ollama,
  tool-calling on Anthropic, responseSchema on Gemini).
- Trim prompt: drop URLs, cap snippets at 200 chars.
- Suppress reasoning on Ollama thinking-by-default models via
  reasoning=False — saves 30-60s per call on qwen3 dense variants.
- Treat empty LLM responses as valid judgments; log a warning on
  batches >2 so users notice a misbehaving model.
- On exception or parse failure, return first N previews (cap=5 or
  max_filtered_results) to avoid overwhelming downstream.

* refactor(relevance_filter): cleanup + add direct tests

* feat(relevance_filter): batch previews in parallel for speed and reliability

Adds two tunable parameters to the LLM relevance filter:

- batch_size: split previews into chunks before sending to the LLM.
  Each batch uses local indices [0..batch_size-1] mapped back to
  global. Default 10. Smaller batches are faster per call AND more
  reliable on weaker models that struggle with many indices in one
  context.

- max_parallel_batches: dispatch batches concurrently via a
  ThreadPoolExecutor. Default 4. Result order is preserved across
  parallel batches.

Both exposed as BaseSearchEngine class attributes
(relevance_filter_batch_size, relevance_filter_max_parallel_batches)
so individual engines can override.

Failure semantics:
- Hard exception on any batch -> capped slice fallback (unchanged).
- Parse failure on a single batch -> skip that batch only, keep
  results from successful batches.

Adds 4 direct unit tests covering chunk/index mapping, batch_size=None
single-call mode, failed-batch-skip-keeps-others, and parallel dispatch
order preservation. All 120 tests pass.

* refactor(relevance_filter): drop structured output, parse plain text

The Pydantic with_structured_output() path had several issues:
- qwen3 dense models returned prose instead of JSON, raising
  OutputParserException and disabling the filter for that call
- grammar-constrained output on Ollama was 6-10x slower than plain
  text generation (~24s vs ~4s for 50 previews)
- per-provider quirks (function_calling latency, schema bikeshedding)

Switch to plain llm.invoke() and parse integers from the response with
a tightened regex (word-boundary, no decimal fractions). The prompt
now instructs the model to output ONLY the indices, which combined
with the regex is robust against prose-injection of small numbers.

Removes RelevanceResult Pydantic class, _invoke_structured, the
_BATCH_FAILED_PARSE sentinel, and the "all batches failed" branch
(all dead under the new contract). Updates tests to mock llm.invoke
directly. Tightens default batch_size to 5 and parallel batches to 10
based on benchmark runs against Ollama.

* docs: fix stale _filter_for_relevance docstring after text-parsing rewrite

2026-04-06 23:04:47 +02:00

18 KiB

Raw Blame History

Extension Guide

This guide explains how to extend Local Deep Research with custom components.

Adding Custom Search Engines
Adding Custom Search Strategies
Using LangChain Retrievers
Adding Custom LLM Providers
Registering Custom LLMs

Adding Custom Search Engines

Search engines are responsible for fetching results from external sources. All engines extend BaseSearchEngine.

Basic Search Engine

Create a new file in src/local_deep_research/web_search_engines/engines/:

# search_engine_custom.py
from typing import Any, Dict, List, Optional

from langchain_core.language_models import BaseLLM
from loguru import logger

from ..search_engine_base import BaseSearchEngine


class CustomSearchEngine(BaseSearchEngine):
    """Custom search engine implementation."""

    # Classification flags - set appropriately for your engine
    is_public = True       # Searches public internet
    is_generic = False     # Specialized (vs general web search)
    is_scientific = False  # Academic/scientific content
    is_local = False       # Local document search
    is_news = False        # News content
    is_code = False        # Code repositories
    is_lexical = False                  # Uses keyword/lexical search (informational)
    needs_llm_relevance_filter = False  # Set True to auto-enable LLM relevance filtering

    def __init__(
        self,
        max_results: int = 10,
        credential: Optional[str] = None,
        llm: Optional[BaseLLM] = None,
        max_filtered_results: Optional[int] = None,
        **kwargs,
    ):
        """
        Initialize the search engine.

        Args:
            max_results: Maximum number of results to return
            credential: API credential for the service (if required)
            llm: Language model for relevance filtering
            max_filtered_results: Max results after filtering
            **kwargs: Additional parameters
        """
        super().__init__(
            llm=llm,
            max_filtered_results=max_filtered_results,
            max_results=max_results,
        )
        self.credential = credential

    def _get_previews(self, query: str) -> List[Dict[str, Any]]:
        """
        Get preview results (first phase of two-phase retrieval).

        Args:
            query: Search query

        Returns:
            List of preview dictionaries with keys:
            - id: Unique identifier
            - title: Result title
            - snippet: Brief description/summary
            - link: URL to the content
            - source: Source name (e.g., "CustomEngine")
        """
        logger.info(f"Searching custom engine for: {query}")

        # Apply rate limiting before request
        self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)

        # Your search implementation here
        results = self._call_api(query)

        previews = []
        for item in results:
            previews.append({
                "id": item["id"],
                "title": item["title"],
                "snippet": item["description"],
                "link": item["url"],
                "source": "CustomEngine",
            })

        return previews

    def _get_full_content(
        self, relevant_items: List[Dict[str, Any]]
    ) -> List[Dict[str, Any]]:
        """
        Get full content for relevant items (second phase).

        Args:
            relevant_items: Items that passed relevance filtering

        Returns:
            Items enriched with full content
        """
        results = []
        for item in relevant_items:
            # Apply rate limiting
            self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)

            # Fetch full content
            full_content = self._fetch_content(item["link"])

            result = item.copy()
            result["content"] = full_content
            result["full_content"] = full_content
            results.append(result)

        return results

    def _call_api(self, query: str) -> List[Dict]:
        """Your API implementation."""
        # Implement your search logic here
        pass

    def _fetch_content(self, url: str) -> str:
        """Fetch full content from URL."""
        # Implement content fetching
        pass

Registering the Engine

Option 1: Register in engine_registry.py (Required)

Add the engine to src/local_deep_research/web_search_engines/engine_registry.py so the system knows how to load it. The registry maps engine names to their Python module and class:

# In engine_registry.py — ENGINE_REGISTRY dict
"custom_engine": EngineEntry(
    module_path=".engines.search_engine_custom",
    class_name="CustomSearchEngine",
),

Module paths must be relative (starting with .) and listed in the security whitelist (ALLOWED_MODULE_PATHS in module_whitelist.py).

Option 1b: Configure user-facing settings (Optional)

After registering in the engine registry, you can expose user-configurable settings via the settings database:

# Key: search.engine.web.custom_engine
config = {
    "requires_api_key": True,
    "requires_llm": False,
    "description": "Custom search engine for specific use case",
    "strengths": ["Feature 1", "Feature 2"],
    "weaknesses": ["Limitation 1"],
    "reliability": 0.8,
    "default_params": {
        "max_results": 10
    }
}

Option 2: Modify Factory (For Core Engines)

Add to search_engine_factory.py:

def create_search_engine(engine_name: str, ...) -> BaseSearchEngine:
    # ... existing code ...

    if engine_name.lower() == "custom_engine":
        from .engines.search_engine_custom import CustomSearchEngine
        return CustomSearchEngine(
            max_results=max_results,
            api_key=api_key,
            llm=llm,
            **kwargs
        )

Search Engine Best Practices

Always apply rate limiting before API calls:

self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)

Set classification flags accurately - they affect engine selection. For keyword-based engines without ML ranking, set is_lexical = True and needs_llm_relevance_filter = True — the factory will auto-enable LLM relevance filtering
Handle errors gracefully - return empty list on failure, don't crash

Use logging for debugging:

from loguru import logger
logger.info(f"Searching for: {query}")
logger.error(f"API error: {e}")

Support snippet-only mode by checking the config:

from ...config import search_config
if search_config.SEARCH_SNIPPETS_ONLY:
    return relevant_items  # Skip full content

Adding Custom Search Strategies

Strategies define how research is conducted - question generation, iteration, and synthesis.

Basic Strategy

Create a new file in src/local_deep_research/advanced_search_system/strategies/:

# my_custom_strategy.py
from typing import Dict, List, Optional
from loguru import logger

from .base_strategy import BaseSearchStrategy


class MyCustomStrategy(BaseSearchStrategy):
    """Custom search strategy implementation."""

    def __init__(
        self,
        search=None,
        model=None,
        all_links_of_system=None,
        settings_snapshot=None,
        max_iterations: int = 3,
        **kwargs,
    ):
        """
        Initialize the strategy.

        Args:
            search: Search engine instance
            model: LLM for question generation and synthesis
            all_links_of_system: Shared list for discovered links
            settings_snapshot: Configuration snapshot
            max_iterations: Maximum research iterations
            **kwargs: Additional parameters
        """
        super().__init__(
            all_links_of_system=all_links_of_system,
            settings_snapshot=settings_snapshot,
        )
        self.search = search
        self.model = model
        self.max_iterations = max_iterations

    def analyze_topic(self, query: str) -> Dict:
        """
        Execute the research strategy.

        Args:
            query: Research query

        Returns:
            Dict with:
            - findings: List of research findings
            - iterations: Number of iterations completed
            - questions: Dict of questions by iteration
            - formatted_findings: Formatted output string
            - current_knowledge: Accumulated knowledge dict
            - error: Optional error message
        """
        logger.info(f"Starting custom strategy for: {query}")

        findings = []
        current_knowledge = {}

        try:
            for iteration in range(1, self.max_iterations + 1):
                # Update progress
                self._update_progress(
                    f"Iteration {iteration}/{self.max_iterations}",
                    progress_percent=int(iteration / self.max_iterations * 100),
                    metadata={"iteration": iteration}
                )

                # Generate questions for this iteration
                questions = self._generate_questions(query, current_knowledge)
                self.questions_by_iteration[iteration] = questions

                # Search for each question
                for question in questions:
                    results = self._search(question)
                    findings.extend(results)

                    # Track links
                    for result in results:
                        if result.get("link"):
                            self.all_links_of_system.append(result["link"])

                # Synthesize findings
                current_knowledge = self._synthesize(findings)

                # Check if we should stop early
                if self._should_stop(current_knowledge):
                    logger.info(f"Early stopping at iteration {iteration}")
                    break

            # Format final output
            formatted = self._format_findings(findings, current_knowledge)

            return {
                "findings": findings,
                "iterations": iteration,
                "questions": self.questions_by_iteration,
                "formatted_findings": formatted,
                "current_knowledge": current_knowledge,
            }

        except Exception as e:
            logger.error(f"Strategy error: {e}")
            return {
                "findings": findings,
                "iterations": 0,
                "questions": self.questions_by_iteration,
                "formatted_findings": "",
                "current_knowledge": current_knowledge,
                "error": str(e),
            }

    def _generate_questions(self, query: str, knowledge: Dict) -> List[str]:
        """Generate research questions using the LLM."""
        prompt = f"""Given the query: {query}
        And current knowledge: {knowledge}
        Generate 3 specific research questions."""

        response = self.model.invoke(prompt)
        # Parse response into questions
        return self._parse_questions(response.content)

    def _search(self, question: str) -> List[Dict]:
        """Execute search for a question."""
        return self.search.run(question)

    def _synthesize(self, findings: List[Dict]) -> Dict:
        """Synthesize findings into knowledge."""
        # Implement synthesis logic
        return {"summary": "...", "key_points": [...]}

    def _should_stop(self, knowledge: Dict) -> bool:
        """Check if research should stop early."""
        # Implement stopping criteria
        return False

    def _format_findings(self, findings: List[Dict], knowledge: Dict) -> str:
        """Format findings as output string."""
        # Implement formatting
        return "Formatted research results..."

    def _parse_questions(self, content: str) -> List[str]:
        """Parse LLM response into question list."""
        # Implement parsing
        return content.strip().split("\n")

Registering the Strategy

Add to search_system_factory.py:

def create_strategy(strategy_name: str, ...) -> BaseSearchStrategy:
    strategy_name_lower = strategy_name.lower()

    # ... existing strategies ...

    elif strategy_name_lower in ["my-custom", "mycustom", "custom"]:
        from .advanced_search_system.strategies.my_custom_strategy import (
            MyCustomStrategy,
        )
        return MyCustomStrategy(
            search=search,
            model=model,
            all_links_of_system=all_links_of_system,
            settings_snapshot=settings_snapshot,
            **kwargs
        )

Strategy Best Practices

Use progress callbacks to update the UI:

self._update_progress("Searching...", progress_percent=50)

Track all discovered links in self.all_links_of_system
Store questions by iteration in self.questions_by_iteration

Access settings via the snapshot:

max_results = self.get_setting("search.max_results", default=10)

Handle errors gracefully - return partial results with error message

Using LangChain Retrievers

The easiest way to add custom search is through LangChain retrievers.

Registering a Retriever

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from local_deep_research.web_search_engines.retriever_registry import retriever_registry

# Create your retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Register globally
retriever_registry.register("my_documents", retriever)

# Now use in research
from local_deep_research.api import quick_summary

result = quick_summary(
    query="What does the documentation say about X?",
    search_tool="my_documents",  # Use registered retriever
    programmatic_mode=True
)

Passing Retrievers Directly

from local_deep_research.api import quick_summary

# Create retriever
retriever = my_vectorstore.as_retriever()

# Pass directly to API
result = quick_summary(
    query="Search my documents",
    retrievers={"private_docs": retriever},
    search_tool="private_docs",
    programmatic_mode=True
)

Registry Methods

from local_deep_research.web_search_engines.retriever_registry import retriever_registry

# Register
retriever_registry.register("name", retriever)
retriever_registry.register_multiple({"a": ret1, "b": ret2})

# Query
retriever_registry.get("name")
retriever_registry.is_registered("name")
retriever_registry.list_registered()

# Remove
retriever_registry.unregister("name")
retriever_registry.clear()

Adding Custom LLM Providers

LLM providers wrap language model APIs for use in LDR.

Basic Provider

Create in src/local_deep_research/llm/providers/implementations/:

# my_provider.py
from typing import Any, Dict, Optional

from langchain_core.language_models import BaseChatModel
from langchain_openai import ChatOpenAI

from ..openai_compatible import OpenAICompatibleProvider


class MyProvider(OpenAICompatibleProvider):
    """Custom LLM provider."""

    provider_name = "My Provider"
    api_key_setting = "llm.my_provider.api_key"
    url_setting = "llm.my_provider.url"
    default_base_url = "https://api.myprovider.com/v1"
    default_model = "my-model-v1"

    @classmethod
    def create_llm(
        cls,
        model_name: Optional[str] = None,
        temperature: float = 0.7,
        settings_snapshot: Optional[Dict] = None,
        **kwargs
    ) -> BaseChatModel:
        """
        Create LLM instance.

        Args:
            model_name: Model to use
            temperature: Sampling temperature
            settings_snapshot: Configuration
            **kwargs: Additional parameters

        Returns:
            LangChain chat model instance
        """
        settings_snapshot = settings_snapshot or {}

        # Get API key from settings
        api_key = cls._get_setting(settings_snapshot, cls.api_key_setting)
        if not api_key:
            raise ValueError(f"API key not found in {cls.api_key_setting}")

        # Get base URL
        base_url = cls._get_setting(
            settings_snapshot, cls.url_setting, cls.default_base_url
        )

        return ChatOpenAI(
            model=model_name or cls.default_model,
            temperature=temperature,
            api_key=api_key,
            base_url=base_url,
            **kwargs
        )

    @classmethod
    def list_models(cls, settings_snapshot: Optional[Dict] = None) -> list[str]:
        """List available models."""
        return ["my-model-v1", "my-model-v2", "my-model-large"]

Register in Auto-Discovery

Add to auto_discovery.py:

PROVIDER_METADATA = {
    # ... existing providers ...
    "my_provider": ProviderMetadata(
        provider_id="my_provider",
        provider_name="My Provider",
        company_name="My Company",
        region="US",
        country="United States",
        data_location="US",
        gdpr_compliant=False,
        is_cloud=True,
    ),
}

Registering Custom LLMs

For programmatic use, register LLMs directly:

from langchain_openai import ChatOpenAI
from local_deep_research.llm.llm_registry import register_llm, get_llm_from_registry

# Create custom LLM
custom_llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.5,
    api_key="...",
)

# Register it
register_llm("my_gpt4", custom_llm)

# Use in research
from local_deep_research.api import quick_summary

result = quick_summary(
    query="Research topic",
    llms={"my_gpt4": custom_llm},  # Or use registered name
    provider_name="my_gpt4",
    programmatic_mode=True
)

Factory Functions

You can also register factory functions:

def create_my_llm(temperature=0.7):
    return ChatOpenAI(model="gpt-4", temperature=temperature)

register_llm("my_factory", create_my_llm)

# Will be called when needed
llm = get_llm_from_registry("my_factory")

18 KiB

Raw Blame History

Extension Guide

Table of Contents

Adding Custom Search Engines

Basic Search Engine

Registering the Engine

Search Engine Best Practices

Adding Custom Search Strategies

Basic Strategy

Registering the Strategy

Strategy Best Practices

Using LangChain Retrievers

Registering a Retriever

Passing Retrievers Directly

Registry Methods

Adding Custom LLM Providers

Basic Provider

Register in Auto-Discovery

Registering Custom LLMs

Factory Functions

See Also

18 KiB Raw Blame History

Extension Guide

Table of Contents

Adding Custom Search Engines

Basic Search Engine

Registering the Engine

Search Engine Best Practices

Adding Custom Search Strategies

Basic Strategy

Registering the Strategy

Strategy Best Practices

Using LangChain Retrievers

Registering a Retriever

Passing Retrievers Directly

Registry Methods

Adding Custom LLM Providers

Basic Provider

Register in Auto-Discovery

Registering Custom LLMs

Factory Functions

See Also

18 KiB

Raw Blame History