mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-16 03:51:07 +03:00
* chore(lint): add ruff rules for logging, performance, exceptions, and print detection Add wave 2 lint rules: G, PERF, RET, TRY, T20, C4, ERA. All existing violations are suppressed via ignore/per-file-ignores so this config change is merge-safe. Follow-up PRs will fix violations and remove the ignore entries incrementally. * fix(lint): exempt pre-commit hooks from T201 print rule (#3270) Pre-commit hooks are CLI scripts where print is the intended output interface, same as scripts/ and cli/ directories already exempted. * fix(lint): fix all low-count ruff violations instead of suppressing them (#3275) * fix(lint): replace manual dict-building loops with dict comprehensions (PERF403) * fix(lint): replace bare Exception raises with specific built-in types (TRY002) Replace all `raise Exception(...)` in production code with appropriate built-in exception types: RuntimeError for operational/state failures, ValueError for invalid data, and ConnectionError for HTTP errors. * fix(lint): resolve TRY004 and PERF402 ruff violations Use TypeError instead of ValueError for isinstance/issubclass type checks (TRY004), and replace manual for-loop list copies with list.extend() (PERF402). * fix(lint): fix all low-count ruff violations instead of suppressing them Fix all violations for 15 ruff rules that had ≤10 occurrences each, rather than suppressing them with ignore directives: - TRY002: raise-vanilla-class → use specific built-in exceptions - TRY004: type-check-without-type-error → use TypeError - C408: unnecessary-collection-call → use dict/list literals - C401: unnecessary-generator-set → use set comprehensions - C416: unnecessary-comprehension → use list()/set() - C414: unnecessary-double-cast-or-process → simplify - PERF403: manual-dict-comprehension → use dict comprehensions - PERF102: incorrect-dict-iterator → use .values()/.keys() - PERF402: manual-list-copy → use list.extend() - RET503/RET506/RET507/RET508: superfluous else after return/raise/continue/break - RET501/RET502: unnecessary/implicit return None Adds per-file-ignores for tests/ and examples/ where these patterns are acceptable (e.g. bare Exception in tests, dict() calls in fixtures). * fix(lint): enforce E722, ERA001, RET505 and fix pre-commit RET503 gap (#3276) Remove three rules from the global ignore list by fixing all violations: E722 (bare except) — 6 violations in tests: Replace `except:` with `except Exception:` to avoid swallowing KeyboardInterrupt and SystemExit. ERA001 (commented-out code) — 25 violations: Delete 18 true positives (dead variables, disabled debug logs, commented-out imports). Add `# noqa: ERA001` to 7 false positives (template instructions, type annotations, documentation comments). RET505 (superfluous else after return) — 413 violations: Auto-fix all occurrences. Also fixes 5 cascading RET506/RET507 violations exposed by the RET505 removals. Pre-commit hooks gap: Add RET503 to `.pre-commit-hooks/**` per-file-ignores alongside T201. * fix(lint): enforce RET504 and TRY301 — fix all violations (#3279) * fix(lint): enforce RET504 — collapse unnecessary assign-before-return Auto-fix all 46 RET504 violations via ruff unsafe-fixes: collapse `result = expr; return result` into `return expr`. Remove RET504 from global ignore list. Add to tests/examples per-file-ignores where intermediate variables aid test clarity. Also removes TRY301 from global ignore (violations fixed in next commit). * fix(lint): enforce TRY301 — fix raises inside broad try/except blocks Structural fixes for 65 TRY301 violations: Security-critical fixes: - url_validator.py: move 6 validation raises before try block, replace isinstance-based re-raise with specific except clause - path_validator.py: move validation outside try block - env_settings.py: separate parsing (try) from validation (outside) Route/service fixes: - research_routes.py: replace raise-then-catch with direct error return - mcp/server.py: move all 7 tool validations before try blocks - news/api.py: move validation before try, noqa for db-session raises - notifications: move rate limit and URL validation before try blocks - iterative_refinement_strategy.py: move JSON validation after try Added noqa for intentional patterns: re-raise in except handlers, nested function definitions, db-session-dependent checks, rate limit re-raises for base class retry logic. * merge: resolve conflicts between wave2 lint branch and main Resolve 14 merge conflicts by always starting from main's version and re-applying lint fixes on top: - mcp_strategy.py, ollama.py, security_settings.py, delete_routes.py: Take main's code, re-apply RET505 (remove else: after return) - mcp/server.py (3 conflicts): Take main's ValidationError handlers and set_settings_context, re-apply TRY301 fixes, fix sensitive data logging - research_routes.py: Take main, fix duplicate block (merge artifact) - settings_routes.py: Take main's default-settings fallback feature - meta_search_engine.py, parallel_search_engine.py: Take main's get_available_engines delegation, delete unreachable code - search_engine_ddg.py, search_engine_google_pse.py: Take main's sanitization, re-apply RET506 (if not elif after raise) - rag_routes.py: Accept main's deletion (route moved to delete_routes) - encryption_check.py: Accept main's deletion (dead code) - test_storage_coverage.py: Remove broken test classes referencing undefined stubs - pre-commit hooks: extend per-file-ignores for ERA001, RET504 * fix: revert ValueError→TypeError changes that break tests and API contracts Revert TRY004 fixes in 3 files where changing ValueError to TypeError would break existing tests and HTTP status code contracts: - card_factory.py: 5 tests assert pytest.raises(ValueError) - base_rater.py: flask_api.py catches ValueError for HTTP 400 responses; TypeError would fall through to HTTP 500 - full_search.py: test asserts pytest.raises(ValueError) Add # noqa: TRY004 to suppress the lint rule on these lines. * fix: move benchmark_data check back inside try block The ValueError for missing benchmark_data must be inside the try/except so the except handler can mark the run as FAILED in the database. Without this, the exception propagates unhandled in a daemon thread, leaving the benchmark run stuck in RUNNING state permanently. * chore(lint): remove ERA rule and suppress TRY004 globally Remove ERA (eradicate — commented-out code detection) from ruff select: - 28% false positive rate in our codebase (7 of 25 violations) - No major Python project enables it (Django, FastAPI, Pydantic, Airflow) - Ruff itself doesn't use it; autofix was demoted to manual-only - 172 noqa suppressions provided zero enforcement value Suppress TRY004 (type-check-without-type-error) globally: - Ruff maintainer agreed the autofix "can change functionality" - We already had to revert 3 TypeError changes that broke tests and HTTP 400→500 API contracts - Django, Flask, pandas all use isinstance + ValueError routinely - Pylint has no equivalent rule; near-zero PyPI adoption Remove all 173 # noqa: ERA001 and 49 # noqa: TRY004 comments from the codebase — no longer needed with rules disabled/suppressed. * fix: resolve mypy errors, failing MCP test, and TRY301 noqa - search_engine_factory.py: restore typed intermediate variable to fix mypy no-any-return (RET504 collapse lost the type annotation) - search_engine_pubchem.py: add explicit list[str] type annotation - test_edge_cases.py: fix assertion that expected engine name in sanitized error message - mcp/server.py: add noqa: TRY301 to validation raises inside try blocks (from main's new merge code)
354 lines
11 KiB
Python
354 lines
11 KiB
Python
"""Tests for cache-related database models."""
|
|
|
|
import hashlib
|
|
import json
|
|
import time
|
|
from datetime import datetime, timedelta, timezone, UTC
|
|
|
|
import pytest
|
|
from sqlalchemy import create_engine
|
|
from sqlalchemy.orm import sessionmaker
|
|
|
|
from local_deep_research.database.models import Base, Cache, SearchCache
|
|
|
|
|
|
class TestCacheModels:
|
|
"""Test suite for cache-related models."""
|
|
|
|
@pytest.fixture
|
|
def engine(self):
|
|
"""Create an in-memory SQLite database for testing."""
|
|
engine = create_engine("sqlite:///:memory:")
|
|
Base.metadata.create_all(engine)
|
|
yield engine
|
|
engine.dispose()
|
|
|
|
@pytest.fixture
|
|
def session(self, engine):
|
|
"""Create a database session for testing."""
|
|
Session = sessionmaker(bind=engine)
|
|
session = Session()
|
|
yield session
|
|
session.close()
|
|
|
|
def test_cache_creation(self, session):
|
|
"""Test creating basic cache entries."""
|
|
cache_entry = Cache(
|
|
cache_key="llm_response_12345",
|
|
cache_text="This is the cached LLM response for the query about quantum physics.",
|
|
cache_type="llm_response",
|
|
source="openai",
|
|
ttl_seconds=86400, # 24 hours
|
|
expires_at=datetime.now(UTC) + timedelta(hours=24),
|
|
cache_value={
|
|
"model": "gpt-4",
|
|
"temperature": 0.7,
|
|
"query": "explain quantum entanglement",
|
|
},
|
|
size_bytes=1024,
|
|
)
|
|
|
|
session.add(cache_entry)
|
|
session.commit()
|
|
|
|
# Verify cache entry
|
|
saved = session.query(Cache).first()
|
|
assert saved is not None
|
|
assert saved.cache_key == "llm_response_12345"
|
|
assert "quantum physics" in saved.cache_text
|
|
assert saved.cache_type == "llm_response"
|
|
assert saved.cache_value["model"] == "gpt-4"
|
|
assert saved.hit_count == 0
|
|
assert saved.size_bytes == 1024
|
|
|
|
def test_cache_expiration(self, session):
|
|
"""Test cache expiration functionality."""
|
|
now = datetime.now(UTC)
|
|
|
|
# Create expired cache
|
|
expired = Cache(
|
|
cache_key="expired_cache",
|
|
cache_text="Old data",
|
|
cache_type="test",
|
|
expires_at=now - timedelta(hours=1),
|
|
)
|
|
|
|
# Create valid cache
|
|
valid = Cache(
|
|
cache_key="valid_cache",
|
|
cache_text="Fresh data",
|
|
cache_type="test",
|
|
expires_at=now + timedelta(hours=1),
|
|
)
|
|
|
|
# Create non-expiring cache
|
|
permanent = Cache(
|
|
cache_key="permanent_cache",
|
|
cache_text="Never expires",
|
|
cache_type="test",
|
|
expires_at=None,
|
|
)
|
|
|
|
session.add_all([expired, valid, permanent])
|
|
session.commit()
|
|
|
|
# Test is_expired method
|
|
assert expired.is_expired() is True
|
|
assert valid.is_expired() is False
|
|
assert permanent.is_expired() is False
|
|
|
|
# Query non-expired entries
|
|
non_expired = (
|
|
session.query(Cache)
|
|
.filter((Cache.expires_at.is_(None)) | (Cache.expires_at > now))
|
|
.all()
|
|
)
|
|
|
|
assert len(non_expired) == 2
|
|
keys = [c.cache_key for c in non_expired]
|
|
assert "valid_cache" in keys
|
|
assert "permanent_cache" in keys
|
|
|
|
def test_search_cache(self, session):
|
|
"""Test search-specific cache functionality."""
|
|
query = "quantum physics research"
|
|
query_hash = hashlib.sha256(query.encode()).hexdigest()
|
|
|
|
current_time = int(time.time())
|
|
search_cache = SearchCache(
|
|
query_hash=query_hash,
|
|
query_text=query,
|
|
results=json.dumps(
|
|
[
|
|
{
|
|
"title": "Quantum Mechanics",
|
|
"url": "https://example.com/qm",
|
|
},
|
|
{"title": "Physics Today", "url": "https://example.com/pt"},
|
|
]
|
|
),
|
|
created_at=current_time,
|
|
expires_at=current_time + 21600, # 6 hours
|
|
last_accessed=current_time,
|
|
access_count=1,
|
|
)
|
|
|
|
session.add(search_cache)
|
|
session.commit()
|
|
|
|
# Verify search cache
|
|
saved = (
|
|
session.query(SearchCache).filter_by(query_hash=query_hash).first()
|
|
)
|
|
assert saved is not None
|
|
assert saved.query_text == query
|
|
results = json.loads(saved.results)
|
|
assert len(results) == 2
|
|
assert saved.access_count == 1
|
|
|
|
def test_cache_categories(self, session):
|
|
"""Test different cache categories."""
|
|
categories = [
|
|
("llm_response", "AI generated content", "openai"),
|
|
("search_result", "Search engine results", "google"),
|
|
("api_response", "External API response", "external"),
|
|
("computation", "Expensive computation result", "local"),
|
|
]
|
|
|
|
for cache_type, value, source in categories:
|
|
cache = Cache(
|
|
cache_key=f"{cache_type}_test",
|
|
cache_text=value,
|
|
cache_type=cache_type,
|
|
source=source,
|
|
expires_at=datetime.now(UTC) + timedelta(hours=1),
|
|
)
|
|
session.add(cache)
|
|
|
|
session.commit()
|
|
|
|
# Query by category
|
|
llm_caches = (
|
|
session.query(Cache).filter_by(cache_type="llm_response").all()
|
|
)
|
|
assert len(llm_caches) == 1
|
|
assert llm_caches[0].cache_text == "AI generated content"
|
|
assert llm_caches[0].source == "openai"
|
|
|
|
def test_cache_hit_tracking(self, session):
|
|
"""Test cache hit counting and access time updates."""
|
|
cache = Cache(
|
|
cache_key="hit_test",
|
|
cache_text="Test content",
|
|
cache_type="test",
|
|
hit_count=0,
|
|
)
|
|
|
|
session.add(cache)
|
|
session.commit()
|
|
|
|
# Record multiple hits
|
|
original_accessed = cache.accessed_at
|
|
for i in range(5):
|
|
cache.record_hit()
|
|
session.commit()
|
|
|
|
assert cache.hit_count == 5
|
|
assert cache.accessed_at > original_accessed
|
|
|
|
def test_search_cache_deduplication(self, session):
|
|
"""Test that identical queries produce the same hash."""
|
|
query1 = "machine learning algorithms"
|
|
query2 = "machine learning algorithms" # Same query
|
|
query3 = "Machine Learning Algorithms" # Different case
|
|
|
|
hash1 = hashlib.sha256(query1.encode()).hexdigest()
|
|
hash2 = hashlib.sha256(query2.encode()).hexdigest()
|
|
hash3 = hashlib.sha256(query3.encode()).hexdigest()
|
|
|
|
assert hash1 == hash2
|
|
assert hash1 != hash3 # Different case produces different hash
|
|
|
|
def test_cache_size_management(self, session):
|
|
"""Test tracking cache entry sizes."""
|
|
large_text = "x" * 10000
|
|
small_text = "small"
|
|
|
|
large_cache = Cache(
|
|
cache_key="large_entry",
|
|
cache_text=large_text,
|
|
cache_type="test",
|
|
size_bytes=len(large_text.encode()),
|
|
)
|
|
|
|
small_cache = Cache(
|
|
cache_key="small_entry",
|
|
cache_text=small_text,
|
|
cache_type="test",
|
|
size_bytes=len(small_text.encode()),
|
|
)
|
|
|
|
session.add_all([large_cache, small_cache])
|
|
session.commit()
|
|
|
|
# Query total cache size - sum all sizes
|
|
from sqlalchemy import func
|
|
|
|
total_size = (
|
|
session.query(func.sum(Cache.size_bytes))
|
|
.filter(Cache.size_bytes.isnot(None))
|
|
.scalar()
|
|
or 0
|
|
)
|
|
|
|
assert large_cache.size_bytes > small_cache.size_bytes
|
|
assert total_size > 10000
|
|
|
|
def test_cache_metadata_usage(self, session):
|
|
"""Test storing and retrieving cache metadata."""
|
|
metadata = {
|
|
"model": "gpt-4",
|
|
"temperature": 0.7,
|
|
"max_tokens": 1000,
|
|
"timestamp": "2024-01-01T00:00:00Z",
|
|
}
|
|
|
|
cache = Cache(
|
|
cache_key="metadata_test",
|
|
cache_text="Response text",
|
|
cache_type="llm_response",
|
|
cache_value=metadata,
|
|
)
|
|
|
|
session.add(cache)
|
|
session.commit()
|
|
|
|
saved = session.query(Cache).first()
|
|
assert saved.cache_value == metadata
|
|
assert saved.cache_value["model"] == "gpt-4"
|
|
|
|
def test_search_cache_with_filters(self, session):
|
|
"""Test search cache with various filter parameters."""
|
|
current_time = int(time.time())
|
|
|
|
# Add multiple search caches
|
|
queries = [
|
|
("python tutorials", current_time - 3600), # 1 hour ago
|
|
("javascript frameworks", current_time - 7200), # 2 hours ago
|
|
("rust programming", current_time - 86400), # 1 day ago
|
|
]
|
|
|
|
for query, created in queries:
|
|
query_hash = hashlib.sha256(query.encode()).hexdigest()
|
|
cache = SearchCache(
|
|
query_hash=query_hash,
|
|
query_text=query,
|
|
results=json.dumps([{"title": f"Result for {query}"}]),
|
|
created_at=created,
|
|
expires_at=created + 86400, # 24 hour TTL
|
|
last_accessed=created,
|
|
access_count=1,
|
|
)
|
|
session.add(cache)
|
|
|
|
session.commit()
|
|
|
|
# Query recent caches (last 3 hours)
|
|
recent_threshold = current_time - 10800
|
|
recent_caches = (
|
|
session.query(SearchCache)
|
|
.filter(SearchCache.created_at >= recent_threshold)
|
|
.all()
|
|
)
|
|
|
|
assert len(recent_caches) == 2
|
|
|
|
def test_cache_cleanup_old_entries(self, session):
|
|
"""Test cleanup of expired cache entries."""
|
|
now = datetime.now(timezone.utc)
|
|
|
|
# Create caches with different expiration times
|
|
for i in range(10):
|
|
cache = Cache(
|
|
cache_key=f"cache_{i}",
|
|
cache_text=f"Content {i}",
|
|
cache_type="test",
|
|
expires_at=now - timedelta(hours=i), # Some expired, some not
|
|
)
|
|
session.add(cache)
|
|
|
|
session.commit()
|
|
|
|
# Delete expired entries
|
|
session.query(Cache).filter(Cache.expires_at < now).delete()
|
|
session.commit()
|
|
|
|
# Verify cleanup
|
|
remaining = session.query(Cache).count()
|
|
assert remaining == 1 # Only cache_0 should remain (expires_at = now)
|
|
|
|
def test_cache_update_operations(self, session):
|
|
"""Test updating cache entries."""
|
|
cache = Cache(
|
|
cache_key="update_test",
|
|
cache_text="Original content",
|
|
cache_type="test",
|
|
ttl_seconds=3600,
|
|
)
|
|
cache.set_ttl(3600) # Set TTL
|
|
|
|
session.add(cache)
|
|
session.commit()
|
|
|
|
# Update content
|
|
cache.cache_text = "Updated content"
|
|
cache.cache_value = {"version": 2}
|
|
session.commit()
|
|
|
|
# Verify updates
|
|
saved = session.query(Cache).filter_by(cache_key="update_test").first()
|
|
assert saved.cache_text == "Updated content"
|
|
assert saved.cache_value["version"] == 2
|
|
assert saved.ttl_seconds == 3600
|
|
assert saved.expires_at is not None
|