mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
- Quote variables to prevent word splitting (SC2086) - Use 'read -r' to prevent backslash mangling (SC2162) - Use 'cd ... || exit' for safe directory changes (SC2164) - Use '-n' instead of '\! -z' for string checks (SC2236) - Use pgrep instead of ps | grep (SC2009) - Check exit codes directly instead of using $? (SC2181) - Declare and assign separately for exports (SC2155) - Fix unused loop variables with underscore prefix (SC2034) - Remove stray markdown backticks from ollama_entrypoint.sh
Claude API Grading Benchmark
This benchmark integrates Claude 3 Sonnet for grading benchmark results with proper API access through the local database.
Features
- Uses Claude 3 Sonnet for grading benchmark results
- Accesses API keys from the local database
- Supports SimpleQA and BrowseComp benchmarks
- Provides composite scoring with customizable weights
- Comprehensive metrics and accuracy reports
Usage
From the project root directory:
# Run with default settings (source_based strategy, 1 iteration, 5 examples)
./examples/benchmarks/claude_grading/run_benchmark.sh
# Run with custom parameters
./examples/benchmarks/claude_grading/run_benchmark.sh --strategy source_based --iterations 2 --examples 200
How It Works
The benchmark integrates with the evaluation system by patching the grading module to use the local get_llm function, which properly retrieves API keys from the database and configures the Claude model for grading.
This approach ensures accurate grading of benchmark results and enables comparison between different strategies and configurations.
Requirements
- Valid Claude API key stored in the local database
- SearXNG search engine running locally
- Python dependencies installed
Output
Results are saved in the benchmark_results directory with comprehensive metrics:
- Accuracy scores
- Processing times
- Grading confidence
- Detailed evaluation reports