local-deep-research/docs/developing/EXTENDING.md

# Extension Guide

This guide explains how to extend Local Deep Research with custom components.

## Table of Contents

- [Adding Custom Search Engines](#adding-custom-search-engines)
- [Adding Custom Search Strategies](#adding-custom-search-strategies)
- [Using LangChain Retrievers](#using-langchain-retrievers)
- [Adding Custom LLM Providers](#adding-custom-llm-providers)
- [Registering Custom LLMs](#registering-custom-llms)

---

## Adding Custom Search Engines

Search engines are responsible for fetching results from external sources. All engines extend `BaseSearchEngine`.

### Basic Search Engine

Create a new file in `src/local_deep_research/web_search_engines/engines/`:

```python
# search_engine_custom.py
from typing import Any, Dict, List, Optional

from langchain_core.language_models import BaseLLM
from loguru import logger

from ..search_engine_base import BaseSearchEngine


class CustomSearchEngine(BaseSearchEngine):
    """Custom search engine implementation."""

    # Classification flags - set appropriately for your engine
    is_public = True       # Searches public internet
    is_generic = False     # Specialized (vs general web search)
    is_scientific = False  # Academic/scientific content
    is_local = False       # Local document search
    is_news = False        # News content
    is_code = False        # Code repositories
    is_lexical = False                  # Uses keyword/lexical search (informational)
    needs_llm_relevance_filter = False  # Set True to auto-enable LLM relevance filtering

    def __init__(
        self,
        max_results: int = 10,
        credential: Optional[str] = None,
        llm: Optional[BaseLLM] = None,
        max_filtered_results: Optional[int] = None,
        **kwargs,
    ):
        """
        Initialize the search engine.

        Args:
            max_results: Maximum number of results to return
            credential: API credential for the service (if required)
            llm: Language model for relevance filtering
            max_filtered_results: Max results after filtering
            **kwargs: Additional parameters
        """
        super().__init__(
            llm=llm,
            max_filtered_results=max_filtered_results,
            max_results=max_results,
        )
        self.credential = credential

    def _get_previews(self, query: str) -> List[Dict[str, Any]]:
        """
        Get preview results (first phase of two-phase retrieval).

        Args:
            query: Search query

        Returns:
            List of preview dictionaries with keys:
            - id: Unique identifier
            - title: Result title
            - snippet: Brief description/summary
            - link: URL to the content
            - source: Source name (e.g., "CustomEngine")
        """
        logger.info(f"Searching custom engine for: {query}")

        # Apply rate limiting before request
        self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)

        # Your search implementation here
        results = self._call_api(query)

        previews = []
        for item in results:
            previews.append({
                "id": item["id"],
                "title": item["title"],
                "snippet": item["description"],
                "link": item["url"],
                "source": "CustomEngine",
            })

        return previews

    def _get_full_content(
        self, relevant_items: List[Dict[str, Any]]
    ) -> List[Dict[str, Any]]:
        """
        Get full content for relevant items (second phase).

        Args:
            relevant_items: Items that passed relevance filtering

        Returns:
            Items enriched with full content
        """
        results = []
        for item in relevant_items:
            # Apply rate limiting
            self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)

            # Fetch full content
            full_content = self._fetch_content(item["link"])

            result = item.copy()
            result["content"] = full_content
            result["full_content"] = full_content
            results.append(result)

        return results

    def _call_api(self, query: str) -> List[Dict]:
        """Your API implementation."""
        # Implement your search logic here
        pass

    def _fetch_content(self, url: str) -> str:
        """Fetch full content from URL."""
        # Implement content fetching
        pass
```

### Registering the Engine

**Option 1: Register in engine_registry.py (Required)**

Add the engine to `src/local_deep_research/web_search_engines/engine_registry.py` so the system knows how to load it. The registry maps engine names to their Python module and class:

```python
# In engine_registry.py — ENGINE_REGISTRY dict
"custom_engine": EngineEntry(
    module_path=".engines.search_engine_custom",
    class_name="CustomSearchEngine",
),
```

Module paths must be relative (starting with `.`) and listed in the security whitelist (`ALLOWED_MODULE_PATHS` in `module_whitelist.py`).

**Option 1b: Configure user-facing settings (Optional)**

After registering in the engine registry, you can expose user-configurable settings via the settings database:

```python
# Key: search.engine.web.custom_engine
config = {
    "requires_api_key": True,
    "requires_llm": False,
    "description": "Custom search engine for specific use case",
    "strengths": ["Feature 1", "Feature 2"],
    "weaknesses": ["Limitation 1"],
    "reliability": 0.8,
    "default_params": {
        "max_results": 10
    }
}
```

**Option 2: Modify Factory (For Core Engines)**

Add to `search_engine_factory.py`:

```python
def create_search_engine(engine_name: str, ...) -> BaseSearchEngine:
    # ... existing code ...

    if engine_name.lower() == "custom_engine":
        from .engines.search_engine_custom import CustomSearchEngine
        return CustomSearchEngine(
            max_results=max_results,
            api_key=api_key,
            llm=llm,
            **kwargs
        )
```

### Search Engine Best Practices

1. **Always apply rate limiting** before API calls:
   ```python
   self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type)
   ```

2. **Set classification flags** accurately - they affect engine selection. For keyword-based engines without ML ranking, set `is_lexical = True` and `needs_llm_relevance_filter = True` — the factory will auto-enable LLM relevance filtering

3. **Handle errors gracefully** - return empty list on failure, don't crash

4. **Use logging** for debugging:
   ```python
   from loguru import logger
   logger.info(f"Searching for: {query}")
   logger.error(f"API error: {e}")
   ```

5. **Support snippet-only mode** by checking the config:
   ```python
   from ...config import search_config
   if search_config.SEARCH_SNIPPETS_ONLY:
       return relevant_items  # Skip full content
   ```

---

## Adding Custom Search Strategies

Strategies define how research is conducted - question generation, iteration, and synthesis.

### Basic Strategy

Create a new file in `src/local_deep_research/advanced_search_system/strategies/`:

```python
# my_custom_strategy.py
from typing import Dict, List, Optional
from loguru import logger

from .base_strategy import BaseSearchStrategy


class MyCustomStrategy(BaseSearchStrategy):
    """Custom search strategy implementation."""

    def __init__(
        self,
        search=None,
        model=None,
        all_links_of_system=None,
        settings_snapshot=None,
        max_iterations: int = 3,
        **kwargs,
    ):
        """
        Initialize the strategy.

        Args:
            search: Search engine instance
            model: LLM for question generation and synthesis
            all_links_of_system: Shared list for discovered links
            settings_snapshot: Configuration snapshot
            max_iterations: Maximum research iterations
            **kwargs: Additional parameters
        """
        super().__init__(
            all_links_of_system=all_links_of_system,
            settings_snapshot=settings_snapshot,
        )
        self.search = search
        self.model = model
        self.max_iterations = max_iterations

    def analyze_topic(self, query: str) -> Dict:
        """
        Execute the research strategy.

        Args:
            query: Research query

        Returns:
            Dict with:
            - findings: List of research findings
            - iterations: Number of iterations completed
            - questions: Dict of questions by iteration
            - formatted_findings: Formatted output string
            - current_knowledge: Accumulated knowledge dict
            - error: Optional error message
        """
        logger.info(f"Starting custom strategy for: {query}")

        findings = []
        current_knowledge = {}

        try:
            for iteration in range(1, self.max_iterations + 1):
                # Update progress
                self._update_progress(
                    f"Iteration {iteration}/{self.max_iterations}",
                    progress_percent=int(iteration / self.max_iterations * 100),
                    metadata={"iteration": iteration}
                )

                # Generate questions for this iteration
                questions = self._generate_questions(query, current_knowledge)
                self.questions_by_iteration[iteration] = questions

                # Search for each question
                for question in questions:
                    results = self._search(question)
                    findings.extend(results)

                    # Track links
                    for result in results:
                        if result.get("link"):
                            self.all_links_of_system.append(result["link"])

                # Synthesize findings
                current_knowledge = self._synthesize(findings)

                # Check if we should stop early
                if self._should_stop(current_knowledge):
                    logger.info(f"Early stopping at iteration {iteration}")
                    break

            # Format final output
            formatted = self._format_findings(findings, current_knowledge)

            return {
                "findings": findings,
                "iterations": iteration,
                "questions": self.questions_by_iteration,
                "formatted_findings": formatted,
                "current_knowledge": current_knowledge,
            }

        except Exception as e:
            logger.error(f"Strategy error: {e}")
            return {
                "findings": findings,
                "iterations": 0,
                "questions": self.questions_by_iteration,
                "formatted_findings": "",
                "current_knowledge": current_knowledge,
                "error": str(e),
            }

    def _generate_questions(self, query: str, knowledge: Dict) -> List[str]:
        """Generate research questions using the LLM."""
        prompt = f"""Given the query: {query}
        And current knowledge: {knowledge}
        Generate 3 specific research questions."""

        response = self.model.invoke(prompt)
        # Parse response into questions
        return self._parse_questions(response.content)

    def _search(self, question: str) -> List[Dict]:
        """Execute search for a question."""
        return self.search.run(question)

    def _synthesize(self, findings: List[Dict]) -> Dict:
        """Synthesize findings into knowledge."""
        # Implement synthesis logic
        return {"summary": "...", "key_points": [...]}

    def _should_stop(self, knowledge: Dict) -> bool:
        """Check if research should stop early."""
        # Implement stopping criteria
        return False

    def _format_findings(self, findings: List[Dict], knowledge: Dict) -> str:
        """Format findings as output string."""
        # Implement formatting
        return "Formatted research results..."

    def _parse_questions(self, content: str) -> List[str]:
        """Parse LLM response into question list."""
        # Implement parsing
        return content.strip().split("\n")
```

### Registering the Strategy

Add to `search_system_factory.py`:

```python
def create_strategy(strategy_name: str, ...) -> BaseSearchStrategy:
    strategy_name_lower = strategy_name.lower()

    # ... existing strategies ...

    elif strategy_name_lower in ["my-custom", "mycustom", "custom"]:
        from .advanced_search_system.strategies.my_custom_strategy import (
            MyCustomStrategy,
        )
        return MyCustomStrategy(
            search=search,
            model=model,
            all_links_of_system=all_links_of_system,
            settings_snapshot=settings_snapshot,
            **kwargs
        )
```

### Strategy Best Practices

1. **Use progress callbacks** to update the UI:
   ```python
   self._update_progress("Searching...", progress_percent=50)
   ```

2. **Track all discovered links** in `self.all_links_of_system`

3. **Store questions by iteration** in `self.questions_by_iteration`

4. **Access settings** via the snapshot:
   ```python
   max_results = self.get_setting("search.max_results", default=10)
   ```

5. **Handle errors gracefully** - return partial results with error message

---

## Using LangChain Retrievers

The easiest way to add custom search is through LangChain retrievers.

### Registering a Retriever

```python
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from local_deep_research.web_search_engines.retriever_registry import retriever_registry

# Create your retriever
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Register globally
retriever_registry.register("my_documents", retriever)

# Now use in research
from local_deep_research.api import quick_summary

result = quick_summary(
    query="What does the documentation say about X?",
    search_tool="my_documents",  # Use registered retriever
    programmatic_mode=True
)
```

### Passing Retrievers Directly

```python
from local_deep_research.api import quick_summary

# Create retriever
retriever = my_vectorstore.as_retriever()

# Pass directly to API
result = quick_summary(
    query="Search my documents",
    retrievers={"private_docs": retriever},
    search_tool="private_docs",
    programmatic_mode=True
)
```

### Registry Methods

```python
from local_deep_research.web_search_engines.retriever_registry import retriever_registry

# Register
retriever_registry.register("name", retriever)
retriever_registry.register_multiple({"a": ret1, "b": ret2})

# Query
retriever_registry.get("name")
retriever_registry.is_registered("name")
retriever_registry.list_registered()

# Remove
retriever_registry.unregister("name")
retriever_registry.clear()
```

---

## Adding Custom LLM Providers

LLM providers wrap language model APIs for use in LDR.

### Basic Provider

Create in `src/local_deep_research/llm/providers/implementations/`:

```python
# my_provider.py
from typing import Any, Dict, Optional

from langchain_core.language_models import BaseChatModel
from langchain_openai import ChatOpenAI

from ..openai_compatible import OpenAICompatibleProvider


class MyProvider(OpenAICompatibleProvider):
    """Custom LLM provider."""

    provider_name = "My Provider"
    api_key_setting = "llm.my_provider.api_key"
    url_setting = "llm.my_provider.url"
    default_base_url = "https://api.myprovider.com/v1"
    default_model = "my-model-v1"

    @classmethod
    def create_llm(
        cls,
        model_name: Optional[str] = None,
        temperature: float = 0.7,
        settings_snapshot: Optional[Dict] = None,
        **kwargs
    ) -> BaseChatModel:
        """
        Create LLM instance.

        Args:
            model_name: Model to use
            temperature: Sampling temperature
            settings_snapshot: Configuration
            **kwargs: Additional parameters

        Returns:
            LangChain chat model instance
        """
        settings_snapshot = settings_snapshot or {}

        # Get API key from settings
        api_key = cls._get_setting(settings_snapshot, cls.api_key_setting)
        if not api_key:
            raise ValueError(f"API key not found in {cls.api_key_setting}")

        # Get base URL
        base_url = cls._get_setting(
            settings_snapshot, cls.url_setting, cls.default_base_url
        )

        return ChatOpenAI(
            model=model_name or cls.default_model,
            temperature=temperature,
            api_key=api_key,
            base_url=base_url,
            **kwargs
        )

    @classmethod
    def list_models(cls, settings_snapshot: Optional[Dict] = None) -> list[str]:
        """List available models."""
        return ["my-model-v1", "my-model-v2", "my-model-large"]
```

### Register in Auto-Discovery

Add to `auto_discovery.py`:

```python
PROVIDER_METADATA = {
    # ... existing providers ...
    "my_provider": ProviderMetadata(
        provider_id="my_provider",
        provider_name="My Provider",
        company_name="My Company",
        region="US",
        country="United States",
        data_location="US",
        gdpr_compliant=False,
        is_cloud=True,
    ),
}
```

---

## Registering Custom LLMs

For programmatic use, register LLMs directly:

```python
from langchain_openai import ChatOpenAI
from local_deep_research.llm.llm_registry import register_llm, get_llm_from_registry

# Create custom LLM
custom_llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.5,
    api_key="...",
)

# Register it
register_llm("my_gpt4", custom_llm)

# Use in research
from local_deep_research.api import quick_summary

result = quick_summary(
    query="Research topic",
    llms={"my_gpt4": custom_llm},  # Or use registered name
    provider_name="my_gpt4",
    programmatic_mode=True
)
```

### Factory Functions

You can also register factory functions:

```python
def create_my_llm(temperature=0.7):
    return ChatOpenAI(model="gpt-4", temperature=temperature)

register_llm("my_factory", create_my_llm)

# Will be called when needed
llm = get_llm_from_registry("my_factory")
```

---

## See Also

- [Architecture Overview](../architecture/OVERVIEW.md) - System architecture
- [Database Schema](../architecture/DATABASE_SCHEMA.md) - Data models
- [Full Configuration Reference](../CONFIGURATION.md) - All settings and environment variables
- [Troubleshooting](../troubleshooting.md) - Common issues
- [API Quickstart](../api-quickstart.md) - Using the API