mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit 309b2a619e Fix shellcheck warnings in all shell scripts

- Quote variables to prevent word splitting (SC2086)
- Use 'read -r' to prevent backslash mangling (SC2162)
- Use 'cd ... || exit' for safe directory changes (SC2164)
- Use '-n' instead of '\! -z' for string checks (SC2236)
- Use pgrep instead of ps | grep (SC2009)
- Check exit codes directly instead of using $? (SC2181)
- Declare and assign separately for exports (SC2155)
- Fix unused loop variables with underscore prefix (SC2034)
- Remove stray markdown backticks from ollama_entrypoint.sh

2025-11-27 19:18:10 +01:00

__init__.py

Fix f-string linting issues in benchmark

2025-05-14 09:25:17 -04:00

benchmark.py

feat: Add pre-commit hook to enforce pathlib usage (issue #640 ) (#656 )

2025-08-17 22:52:35 +02:00

README.md

Fix f-string linting issues in benchmark

2025-05-14 09:25:17 -04:00

run_benchmark.sh

Fix shellcheck warnings in all shell scripts

2025-11-27 19:18:10 +01:00

README.md

Claude API Grading Benchmark

This benchmark integrates Claude 3 Sonnet for grading benchmark results with proper API access through the local database.

Features

Uses Claude 3 Sonnet for grading benchmark results
Accesses API keys from the local database
Supports SimpleQA and BrowseComp benchmarks
Provides composite scoring with customizable weights
Comprehensive metrics and accuracy reports

Usage

From the project root directory:

# Run with default settings (source_based strategy, 1 iteration, 5 examples)
./examples/benchmarks/claude_grading/run_benchmark.sh

# Run with custom parameters
./examples/benchmarks/claude_grading/run_benchmark.sh --strategy source_based --iterations 2 --examples 200

How It Works

The benchmark integrates with the evaluation system by patching the grading module to use the local get_llm function, which properly retrieves API keys from the database and configures the Claude model for grading.

This approach ensures accurate grading of benchmark results and enables comparison between different strategies and configurations.

Requirements

Valid Claude API key stored in the local database
SearXNG search engine running locally
Python dependencies installed

Output

Results are saved in the benchmark_results directory with comprehensive metrics:

Accuracy scores
Processing times
Grading confidence
Detailed evaluation reports