Files
local-deep-research/examples/benchmarks/README.md
LearningCircuit c842f99f7b fix: Resolve CI test failures in search engines
- Add missing 'source' field to Wikipedia and ArXiv search results
- Fix Google PSE to use 'link' instead of 'url' field for consistency
- Update test mocking to work with actual search engine implementations
- Fix Wikipedia tests to mock wikipedia library functions directly
- Fix ArXiv tests to properly mock _get_search_results method
- Improve Google PSE test credential mocking

feat: Add comprehensive security framework and contribution guidelines

- Convert .gitignore to whitelist approach for maximum security
- Add file whitelist CI workflow with comprehensive security checks
- Add pre-commit CI workflow for code quality
- Create CONTRIBUTING.md with security guidelines and dev resources
- Add SECURITY.md for vulnerability reporting process
- Set up Dependabot for automated dependency updates
- Add PR templates (regular and first-time contributor)
- Update pre-commit config with security checks
- Add git hooks setup script for local warnings

fix: Improve .gitignore whitelist to block hidden directories

- Block all dot files/folders by default
- Explicitly allow only necessary dot files (.gitignore, .gitkeep, .github/, etc.)
- Add specific blocks for data directories
- Prevents accidental commits of local settings and sensitive data

fix: Update CI whitelist with minimal required files

- Add .pre-commit-config.yaml and .isort.cfg
- Add CONTRIBUTING.md and SECURITY.md
- Add .github/CODEOWNERS
- Restrict .github/ to only yml/yaml/md files

fix: Use standard pre-commit setup process

- Remove custom setup-hooks.sh script
- Update CONTRIBUTING.md to use standard pre-commit commands
- Update PR template to match Developer Guide
- Align with existing documented process

docs: Improve clarity based on reviewer feedback

- Clarify that file whitelist is configured in .gitignore
- Point users to web UI for configuration (most common case)
- Link to wiki for environment configuration details
- Make documentation more user-friendly for new contributors

docs: Simplify configuration section per review feedback

- Remove code examples for env variables (users typically use web UI)
- Link to Installation wiki page where env vars are properly documented
- Keep focus on security (don't commit secrets) without confusing details

fix: Add .coveragerc to whitelist for test coverage configuration

fix: Resolve pytest timeout in CI environment

- Skip slow tests in CI to prevent 300s timeout
- Add pytest.ini with test markers configuration
- Update whitelist to include .coveragerc and pytest.ini
- Modify run_all_tests.py to use -m 'not slow' in CI mode

fix: Further improvements to prevent test timeouts

- Use python -m pytest instead of pytest command
- Reduce timeout to 180s for CI tests
- Exclude integration tests and problematic config test in CI
- Add -x flag to stop on first failure
- Use shorter traceback format

debug: Temporarily disable -x flag to see all test failures

fix: Prevent pytest timeout in CI by adding per-test timeouts and excluding problematic tests

fix: Improve test failure reporting and add debug script

fix: Fix test failures in CI by correcting imports and handling wrapped LLMs

- Fix wikipedia search engine import paths (WikipediaSearchEngine not WikipediaSearch)
- Update report generator tests to handle wrapped LLM instances
- Fix search system tests to pass llm_instance parameter to get_search
- Skip specific timeout-prone tests in CI (iterdrag, rapid strategies)
- Fix typo in utilities import path

fix: Fix test failures in CI by updating mocks and reflecting strategy changes

- Fix Wikipedia search tests by mocking wikipedia library instead of requests
- Fix factory test timeout by properly mocking db_utils and search config
- Update tests to reflect default strategy change to SourceBasedSearchStrategy
- Fix test_analyze_topic by setting up proper mock attributes

fix: Skip factory test in CI due to persistent timeout issues

The test_factory_with_mocked_llm test continues to timeout in CI environment
despite mocking attempts. Skipping this test in CI while it works locally.

chore: cleanup test artifacts

Add persistent search strategy selector to web UI

- Add strategy dropdown to research form with Source-Based and Focused Iteration options
- Implement localStorage persistence for strategy selection across sessions
- Fix duplicate parameter error in research_functions.py
- Fix milestone logging level initialization in web app
- Add strategy parameter handling throughout request/response chain
2025-06-03 02:57:35 +02:00

1.9 KiB

Benchmarks for Local Deep Research

This directory contains scripts for running benchmarks to evaluate Local Deep Research's performance.

Available Benchmarks

SimpleQA

The SimpleQA benchmark evaluates factual question answering capabilities.

python run_simpleqa.py --examples 10 --iterations 3 --questions 3

Options:

  • --examples: Number of examples to run (default: 10)
  • --iterations: Number of search iterations (default: 3)
  • --questions: Questions per iteration (default: 3)
  • --search-tool: Search tool to use (default: "searxng")
  • --output-dir: Directory to save results (default: "benchmark_results")
  • --no-eval: Skip evaluation
  • --human-eval: Use human evaluation
  • --eval-model: Model to use for evaluation
  • --eval-provider: Provider to use for evaluation

BrowseComp

The BrowseComp benchmark evaluates web browsing comprehension and complex question answering.

python run_browsecomp.py --examples 5 --iterations 3 --questions 3

Options:

  • --examples: Number of examples to run (default: 2)
  • --iterations: Number of search iterations (default: 1)
  • --questions: Questions per iteration (default: 1)
  • --search-tool: Search tool to use (default: "searxng")
  • --output-dir: Directory to save results (default: "browsecomp_results")

See browsecomp_benchmark_readme.md for more information on how BrowseComp works.

Running All Benchmarks

To run both benchmarks and compare results:

# Run SimpleQA with default settings
python run_simpleqa.py

# Run BrowseComp with increased iterations and questions
python run_browsecomp.py --iterations 3 --questions 3

Evaluating Results

Results are saved in the specified output directories and include:

  • Raw results (JSONL format)
  • Evaluation results (JSONL format)
  • Summary reports (Markdown format)

The scripts will also print a summary of the results to the console, including accuracy metrics.