# Extension Guide This guide explains how to extend Local Deep Research with custom components. ## Table of Contents - [Adding Custom Search Engines](#adding-custom-search-engines) - [Adding Custom Search Strategies](#adding-custom-search-strategies) - [Using LangChain Retrievers](#using-langchain-retrievers) - [Adding Custom LLM Providers](#adding-custom-llm-providers) - [Registering Custom LLMs](#registering-custom-llms) --- ## Adding Custom Search Engines Search engines are responsible for fetching results from external sources. All engines extend `BaseSearchEngine`. ### Basic Search Engine Create a new file in `src/local_deep_research/web_search_engines/engines/`: ```python # search_engine_custom.py from typing import Any, Dict, List, Optional from langchain_core.language_models import BaseLLM from loguru import logger from ..search_engine_base import BaseSearchEngine class CustomSearchEngine(BaseSearchEngine): """Custom search engine implementation.""" # Classification flags - set appropriately for your engine is_public = True # Searches public internet is_generic = False # Specialized (vs general web search) is_scientific = False # Academic/scientific content is_local = False # Local document search is_news = False # News content is_code = False # Code repositories is_lexical = False # Uses keyword/lexical search (informational) needs_llm_relevance_filter = False # Set True to auto-enable LLM relevance filtering def __init__( self, max_results: int = 10, credential: Optional[str] = None, llm: Optional[BaseLLM] = None, max_filtered_results: Optional[int] = None, **kwargs, ): """ Initialize the search engine. Args: max_results: Maximum number of results to return credential: API credential for the service (if required) llm: Language model for relevance filtering max_filtered_results: Max results after filtering **kwargs: Additional parameters """ super().__init__( llm=llm, max_filtered_results=max_filtered_results, max_results=max_results, ) self.credential = credential def _get_previews(self, query: str) -> List[Dict[str, Any]]: """ Get preview results (first phase of two-phase retrieval). Args: query: Search query Returns: List of preview dictionaries with keys: - id: Unique identifier - title: Result title - snippet: Brief description/summary - link: URL to the content - source: Source name (e.g., "CustomEngine") """ logger.info(f"Searching custom engine for: {query}") # Apply rate limiting before request self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type) # Your search implementation here results = self._call_api(query) previews = [] for item in results: previews.append({ "id": item["id"], "title": item["title"], "snippet": item["description"], "link": item["url"], "source": "CustomEngine", }) return previews def _get_full_content( self, relevant_items: List[Dict[str, Any]] ) -> List[Dict[str, Any]]: """ Get full content for relevant items (second phase). Args: relevant_items: Items that passed relevance filtering Returns: Items enriched with full content """ results = [] for item in relevant_items: # Apply rate limiting self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type) # Fetch full content full_content = self._fetch_content(item["link"]) result = item.copy() result["content"] = full_content result["full_content"] = full_content results.append(result) return results def _call_api(self, query: str) -> List[Dict]: """Your API implementation.""" # Implement your search logic here pass def _fetch_content(self, url: str) -> str: """Fetch full content from URL.""" # Implement content fetching pass ``` ### Registering the Engine **Option 1: Register in engine_registry.py (Required)** Add the engine to `src/local_deep_research/web_search_engines/engine_registry.py` so the system knows how to load it. The registry maps engine names to their Python module and class: ```python # In engine_registry.py — ENGINE_REGISTRY dict "custom_engine": EngineEntry( module_path=".engines.search_engine_custom", class_name="CustomSearchEngine", ), ``` Module paths must be relative (starting with `.`) and listed in the security whitelist (`ALLOWED_MODULE_PATHS` in `module_whitelist.py`). **Option 1b: Configure user-facing settings (Optional)** After registering in the engine registry, you can expose user-configurable settings via the settings database: ```python # Key: search.engine.web.custom_engine config = { "requires_api_key": True, "requires_llm": False, "description": "Custom search engine for specific use case", "strengths": ["Feature 1", "Feature 2"], "weaknesses": ["Limitation 1"], "reliability": 0.8, "default_params": { "max_results": 10 } } ``` **Option 2: Modify Factory (For Core Engines)** Add to `search_engine_factory.py`: ```python def create_search_engine(engine_name: str, ...) -> BaseSearchEngine: # ... existing code ... if engine_name.lower() == "custom_engine": from .engines.search_engine_custom import CustomSearchEngine return CustomSearchEngine( max_results=max_results, api_key=api_key, llm=llm, **kwargs ) ``` ### Search Engine Best Practices 1. **Always apply rate limiting** before API calls: ```python self._last_wait_time = self.rate_tracker.apply_rate_limit(self.engine_type) ``` 2. **Set classification flags** accurately - they affect engine selection. For keyword-based engines without ML ranking, set `is_lexical = True` and `needs_llm_relevance_filter = True` — the factory will auto-enable LLM relevance filtering 3. **Handle errors gracefully** - return empty list on failure, don't crash 4. **Use logging** for debugging: ```python from loguru import logger logger.info(f"Searching for: {query}") logger.error(f"API error: {e}") ``` 5. **Support snippet-only mode** by checking the config: ```python from ...config import search_config if search_config.SEARCH_SNIPPETS_ONLY: return relevant_items # Skip full content ``` --- ## Adding Custom Search Strategies Strategies define how research is conducted - question generation, iteration, and synthesis. ### Basic Strategy Create a new file in `src/local_deep_research/advanced_search_system/strategies/`: ```python # my_custom_strategy.py from typing import Dict, List, Optional from loguru import logger from .base_strategy import BaseSearchStrategy class MyCustomStrategy(BaseSearchStrategy): """Custom search strategy implementation.""" def __init__( self, search=None, model=None, all_links_of_system=None, settings_snapshot=None, max_iterations: int = 3, **kwargs, ): """ Initialize the strategy. Args: search: Search engine instance model: LLM for question generation and synthesis all_links_of_system: Shared list for discovered links settings_snapshot: Configuration snapshot max_iterations: Maximum research iterations **kwargs: Additional parameters """ super().__init__( all_links_of_system=all_links_of_system, settings_snapshot=settings_snapshot, ) self.search = search self.model = model self.max_iterations = max_iterations def analyze_topic(self, query: str) -> Dict: """ Execute the research strategy. Args: query: Research query Returns: Dict with: - findings: List of research findings - iterations: Number of iterations completed - questions: Dict of questions by iteration - formatted_findings: Formatted output string - current_knowledge: Accumulated knowledge dict - error: Optional error message """ logger.info(f"Starting custom strategy for: {query}") findings = [] current_knowledge = {} try: for iteration in range(1, self.max_iterations + 1): # Update progress self._update_progress( f"Iteration {iteration}/{self.max_iterations}", progress_percent=int(iteration / self.max_iterations * 100), metadata={"iteration": iteration} ) # Generate questions for this iteration questions = self._generate_questions(query, current_knowledge) self.questions_by_iteration[iteration] = questions # Search for each question for question in questions: results = self._search(question) findings.extend(results) # Track links for result in results: if result.get("link"): self.all_links_of_system.append(result["link"]) # Synthesize findings current_knowledge = self._synthesize(findings) # Check if we should stop early if self._should_stop(current_knowledge): logger.info(f"Early stopping at iteration {iteration}") break # Format final output formatted = self._format_findings(findings, current_knowledge) return { "findings": findings, "iterations": iteration, "questions": self.questions_by_iteration, "formatted_findings": formatted, "current_knowledge": current_knowledge, } except Exception as e: logger.error(f"Strategy error: {e}") return { "findings": findings, "iterations": 0, "questions": self.questions_by_iteration, "formatted_findings": "", "current_knowledge": current_knowledge, "error": str(e), } def _generate_questions(self, query: str, knowledge: Dict) -> List[str]: """Generate research questions using the LLM.""" prompt = f"""Given the query: {query} And current knowledge: {knowledge} Generate 3 specific research questions.""" response = self.model.invoke(prompt) # Parse response into questions return self._parse_questions(response.content) def _search(self, question: str) -> List[Dict]: """Execute search for a question.""" return self.search.run(question) def _synthesize(self, findings: List[Dict]) -> Dict: """Synthesize findings into knowledge.""" # Implement synthesis logic return {"summary": "...", "key_points": [...]} def _should_stop(self, knowledge: Dict) -> bool: """Check if research should stop early.""" # Implement stopping criteria return False def _format_findings(self, findings: List[Dict], knowledge: Dict) -> str: """Format findings as output string.""" # Implement formatting return "Formatted research results..." def _parse_questions(self, content: str) -> List[str]: """Parse LLM response into question list.""" # Implement parsing return content.strip().split("\n") ``` ### Registering the Strategy Add to `search_system_factory.py`: ```python def create_strategy(strategy_name: str, ...) -> BaseSearchStrategy: strategy_name_lower = strategy_name.lower() # ... existing strategies ... elif strategy_name_lower in ["my-custom", "mycustom", "custom"]: from .advanced_search_system.strategies.my_custom_strategy import ( MyCustomStrategy, ) return MyCustomStrategy( search=search, model=model, all_links_of_system=all_links_of_system, settings_snapshot=settings_snapshot, **kwargs ) ``` ### Strategy Best Practices 1. **Use progress callbacks** to update the UI: ```python self._update_progress("Searching...", progress_percent=50) ``` 2. **Track all discovered links** in `self.all_links_of_system` 3. **Store questions by iteration** in `self.questions_by_iteration` 4. **Access settings** via the snapshot: ```python max_results = self.get_setting("search.max_results", default=10) ``` 5. **Handle errors gracefully** - return partial results with error message --- ## Using LangChain Retrievers The easiest way to add custom search is through LangChain retrievers. ### Registering a Retriever ```python from langchain_community.vectorstores import FAISS from langchain_openai import OpenAIEmbeddings from local_deep_research.web_search_engines.retriever_registry import retriever_registry # Create your retriever embeddings = OpenAIEmbeddings() vectorstore = FAISS.from_documents(documents, embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # Register globally retriever_registry.register("my_documents", retriever) # Now use in research from local_deep_research.api import quick_summary result = quick_summary( query="What does the documentation say about X?", search_tool="my_documents", # Use registered retriever programmatic_mode=True ) ``` ### Passing Retrievers Directly ```python from local_deep_research.api import quick_summary # Create retriever retriever = my_vectorstore.as_retriever() # Pass directly to API result = quick_summary( query="Search my documents", retrievers={"private_docs": retriever}, search_tool="private_docs", programmatic_mode=True ) ``` ### Registry Methods ```python from local_deep_research.web_search_engines.retriever_registry import retriever_registry # Register retriever_registry.register("name", retriever) retriever_registry.register_multiple({"a": ret1, "b": ret2}) # Query retriever_registry.get("name") retriever_registry.is_registered("name") retriever_registry.list_registered() # Remove retriever_registry.unregister("name") retriever_registry.clear() ``` --- ## Adding Custom LLM Providers LLM providers wrap language model APIs for use in LDR. ### Basic Provider Create in `src/local_deep_research/llm/providers/implementations/`: ```python # my_provider.py from typing import Dict, Optional from langchain_core.language_models import BaseChatModel from langchain_openai import ChatOpenAI from ..openai_base import OpenAICompatibleProvider class MyProvider(OpenAICompatibleProvider): """Custom LLM provider.""" provider_name = "My Provider" api_key_setting = "llm.my_provider.api_key" url_setting = "llm.my_provider.url" default_base_url = "https://api.myprovider.com/v1" default_model = "my-model-v1" # Optional: set to True if missing key should fall back to a placeholder # rather than raising ValueError. api_key_optional = False @classmethod def create_llm( cls, model_name: Optional[str] = None, temperature: float = 0.7, settings_snapshot: Optional[Dict] = None, **kwargs ) -> BaseChatModel: """ Create LLM instance. Args: model_name: Model to use temperature: Sampling temperature settings_snapshot: Configuration **kwargs: Additional parameters Returns: LangChain chat model instance """ from ....config.thread_settings import get_setting_from_snapshot # Resolve API key via the base helper. Raises ValueError when # required and missing, returns the unified placeholder when # api_key_optional=True and the key is unset. api_key = cls.resolve_api_key_or_placeholder(settings_snapshot) # Get base URL base_url = get_setting_from_snapshot( cls.url_setting, cls.default_base_url, settings_snapshot=settings_snapshot, ) return ChatOpenAI( model=model_name or cls.default_model, temperature=temperature, api_key=api_key, base_url=base_url, **kwargs ) @classmethod def list_models(cls, settings_snapshot: Optional[Dict] = None) -> list[str]: """List available models.""" return ["my-model-v1", "my-model-v2", "my-model-large"] ``` ### Register in Auto-Discovery Drop the provider class file into `src/local_deep_research/llm/providers/implementations/`. Auto-discovery will scan that directory at import time and register every class whose name ends with `Provider`, subclasses `BaseLLMProvider`, and has `provider_name` set to a real value (i.e., overridden away from the ``"unknown"`` default). Setting `provider_name = "unknown"` — or leaving it unset on the class — will cause the class to be **silently filtered out** of auto-discovery, which is a common gotcha when copying an existing provider as a template. Optional cloud-metadata registration in `auto_discovery.py`: ```python PROVIDER_METADATA = { # ... existing providers ... "my_provider": ProviderMetadata( provider_id="my_provider", provider_name="My Provider", company_name="My Company", region="US", country="United States", data_location="US", gdpr_compliant=False, is_cloud=True, ), } ``` --- ## Registering Custom LLMs For programmatic use, register LLMs directly: ```python from langchain_openai import ChatOpenAI from local_deep_research.llm.llm_registry import register_llm, get_llm_from_registry # Create custom LLM custom_llm = ChatOpenAI( model="gpt-4", temperature=0.5, api_key="...", ) # Register it register_llm("my_gpt4", custom_llm) # Use in research from local_deep_research.api import quick_summary result = quick_summary( query="Research topic", llms={"my_gpt4": custom_llm}, # Or use registered name provider_name="my_gpt4", programmatic_mode=True ) ``` ### Factory Functions You can also register factory functions: ```python def create_my_llm(temperature=0.7): return ChatOpenAI(model="gpt-4", temperature=temperature) register_llm("my_factory", create_my_llm) # Will be called when needed llm = get_llm_from_registry("my_factory") ``` ### Registry caveat The built-in providers (ollama, openai, anthropic, ...) live in the same registry, auto-registered at import time. `clear_llm_registry()` removes them too, and `get_llm()` has no other construction path — every provider will raise "was not registered by auto-discovery" until you restore them: ```python from local_deep_research.llm.providers import discover_providers discover_providers(force_refresh=True) ``` Prefer `unregister_llm("")` over `clear_llm_registry()` to remove only your own registrations. --- ## See Also - [Architecture Overview](../architecture/OVERVIEW.md) - System architecture - [Database Schema](../architecture/DATABASE_SCHEMA.md) - Data models - [Full Configuration Reference](../CONFIGURATION.md) - All settings and environment variables - [Troubleshooting](../troubleshooting.md) - Common issues - [API Quickstart](../api-quickstart.md) - Using the API