mirror of https://github.com/LearningCircuit/local-deep-research.git synced 2026-06-15 19:46:56 +03:00

Files

LearningCircuit 4ae0041f63 fix(security): harden SSRF metadata blocks and redact log userinfo (#3882 )

* fix(security): SSRF parser-differential bypass (GHSA-g23j-2vwm-5c25)

The SSRF validator parsed URLs with `urllib.parse.urlparse` while
`requests` parsed them with `urllib3`. For URLs like
`http://127.0.0.1\@1.1.1.1` the two parsers disagreed: urlparse
extracted `1.1.1.1` (passing the SSRF check) while requests connected
to `127.0.0.1`.

Two-layer fix in `ssrf_validator.validate_url` and
`NotificationURLValidator.validate_service_url`:

- Layer 1: reject URLs containing backslash, ASCII control bytes, or
  whitespace (RFC 3986 forbids these). Catches the advisory PoC.
- Layer 2: extract host with `urllib3.util.parse_url` — the same parser
  `requests` uses internally — so the validator and the HTTP client
  agree on destination by construction. Load-bearing on the
  SafeSession.send path where requests has canonicalised `\` to `%5C`.

Credit: @Fushuling, @RacerZ-fighting.

* fix(security): block IPv6 unspecified address (::) in SSRF check

Follow-up to the parser-differential SSRF fix. ``::`` (and equivalent
representations ``0::``, ``0:0:0:0:0:0:0:0``, ``::0``) was not in
``BLOCKED_IP_RANGES`` even though the IPv4 equivalent ``0.0.0.0`` was
(via ``0.0.0.0/8``). On Linux the kernel routes connections to
``[::]:port`` to a service bound on ``[::1]:port`` — same semantics as
``0.0.0.0`` for IPv4 — so an attacker could reach loopback services
through the unspecified-IPv6 form.

Verified end-to-end: a server bound on ``[::1]:<port>`` (loopback only)
received connections from ``http://[::]:<port>/`` before this fix and
none after.

Add ``::/128`` to ``PRIVATE_IP_RANGES`` so all four equivalent
representations (``::``, ``0::``, ``::0``, ``0:0:0:0:0:0:0:0``) are
caught after ``ipaddress.ip_address`` normalisation. Adds regression
tests in both ``test_ssrf_validator.py`` and ``test_notification_validator.py``.

* test(security): expand SSRF coverage across DNS, alt IP forms, flags

Adds 62 tests across six new classes in test_ssrf_validator.py:

- TestDnsResolvedBypass — load-bearing path for hostname URLs (not IP
  literals): hostname resolves to loopback / RFC1918 / link-local / AWS
  metadata; multi-A-record DNS with one private IP; gaierror fail-closed;
  IPv6 DNS resolution; IPv4-mapped-IPv6 DNS resolution.
- TestAlternateIpFormsBlocked — octal, decimal-int, short-form (127.1),
  IPv4-mapped IPv6 literals for loopback / RFC1918 / AWS metadata.
- TestAllowFlagMatrix — allow_localhost / allow_private_ips combinations
  against the new ::/128 entry; locks in that :: stays blocked under
  every flag (it is unspecified, not loopback) and that AWS metadata
  stays blocked under every flag.
- TestSchemeRejection — file:, ftp:, gopher:, dict:, schemeless,
  scheme-relative; uppercase HTTPS still accepted (case-insensitive).
- TestNeverRaises — parametrized pathological inputs including empty,
  control bytes, malformed brackets, overflow ports, lone surrogates,
  100k-char URLs. Asserts validate_url returns bool, never raises.
- TestOutOfScopeBehaviorLockedIn — documents current behaviour for 6to4
  (2002:7f00:1::) and NAT64 (64:ff9b::7f00:1) wrapped loopback. These
  pass today (filed as separate hardening); flip the assertions if
  BLOCKED_IP_RANGES is extended.

Full security suite: 3161 passed.

* fix(security): harden SSRF metadata blocks and redact log userinfo

Two defense-in-depth improvements to SSRF protection.

1. Hardcode-block additional cloud-provider metadata IPs.

   Previously only AWS IMDS (169.254.169.254) was always-blocked. The
   same parallel applies to other cloud-credential endpoints that
   become reachable when a caller passes allow_private_ips=True
   (legitimately used for SearXNG / Ollama / etc on private networks):

   - 169.254.170.2  AWS ECS task metadata v3
   - 169.254.170.23 AWS ECS task metadata v4
   - 169.254.0.23   Tencent Cloud
   - 100.100.100.200 AlibabaCloud

   Replace AWS_METADATA_IP with ALWAYS_BLOCKED_METADATA_IPS frozenset
   and update the membership check in is_ip_blocked. Test files that
   imported AWS_METADATA_IP updated to import the new constant.

2. Redact userinfo from URL rejection logs.

   RFC 3986 §3.2.1 allows credentials in URL userinfo. Five log sites
   in ssrf_validator.py and three in notification_validator.py used
   to interpolate {url} or url[:50]; route all of them through a new
   redact_url_for_log() helper that returns only scheme://host:port.

Plus drift cleanup: SECURITY.md / SearXNG-Setup.md / safe_requests.py
docstrings / pdf_service.py comment refreshed for the five-IP set.
Tech-debt: add membership tests for ::/128 and 0.0.0.0/8 that were
missing after PR #3873's IPv6-unspecified bypass fix.

* fix(security): address review nits on #3882

- Fix docstring indentation in SafeSession.__init__ (Note: continuation
  was 12-space indented in a 16-space context). Sphinx/autodoc would
  have rendered it misaligned.
- Remove unused _all_metadata_ips helper from
  TestAlwaysBlockedMetadataIPs — both test methods inline the same
  logic; the helper was dead.

AI code review feedback on #3882, no behavior change.

2026-05-09 01:50:05 +02:00

3.9 KiB

Raw Blame History

SearXNG Integration for Local Deep Research

This document explains how to configure and use the SearXNG integration with Local Deep Research.

Configuring SearXNG Access

The SearXNG search engine is disabled by default until you provide an instance URL. This ensures the system doesn't attempt to use public instances without explicit configuration.

Setting Up Access

You have two ways to enable the SearXNG search engine:

Environment Variable (Recommended):

# Add to your .env file or set in your environment
SEARXNG_INSTANCE=http://localhost:8080

# Optional: Set custom delay between requests (in seconds)
SEARXNG_DELAY=2.0

Configuration Parameter: Add to your config.py:

# In config.py
SEARXNG_CONFIG = {
    "instance_url": "http://localhost:8080",
    "delay_between_requests": 2.0
}

Self-Hosting SearXNG (Recommended)

For the most ethical usage, we strongly recommend self-hosting your own SearXNG instance:

Using Docker (easiest method)

# Pull the SearXNG Docker image
docker pull searxng/searxng

# Run SearXNG (will be available at http://localhost:8080)
docker run -d -p 8080:8080 --name searxng searxng/searxng

Using Docker Compose (recommended for production)

Create a file named docker-compose.yml with the following content:

version: '3'
services:
  searxng:
    container_name: searxng
    image: searxng/searxng
    ports:
      - "8080:8080"
    volumes:
      - ./searxng:/etc/searxng
    environment:
      - SEARXNG_BASE_URL=http://localhost:8080/
    restart: unless-stopped

Run with Docker Compose:

docker-compose up -d

Using Public Instances

If you must use a public instance:

Get Permission: Always contact the administrator of any public instance
Respect Resources: Use a longer delay (4-5 seconds minimum) between requests
Limited Usage: Keep your research volume reasonable

Example configuration for a public instance:

SEARXNG_INSTANCE=https://instance.example.com
SEARXNG_DELAY=5.0

Checking Configuration

To verify if SearXNG is properly configured:

from web_search_engines.search_engine_factory import create_search_engine

# Create the engine
engine = create_search_engine("searxng")

# Check if available
if engine and hasattr(engine, 'is_available') and engine.is_available:
    print(f"SearXNG configured with instance: {engine.instance_url}")
    print(f"Delay between requests: {engine.delay_between_requests} seconds")
else:
    print("SearXNG is not properly configured or is disabled")

Network Security

SearXNG is designed for self-hosting, so Local Deep Research allows SearXNG to access private network IPs by default. This means you can run SearXNG on:

Localhost: http://127.0.0.1:8080 or http://localhost:8080
LAN IPs: http://192.168.1.100:8080, http://10.0.0.5:8080, http://172.16.0.2:8080
Docker networks: http://172.17.0.2:8080
Local hostnames: http://searxng.local:8080 (if configured in DNS/hosts)

This is intentional and secure because:

The SearXNG URL is admin-configured, not user input
Private IPs are only accessible from your local network
Cloud metadata endpoints (AWS IMDS / ECS, Azure, OCI, DigitalOcean, AlibabaCloud, Tencent Cloud — see ssrf_validator.ALWAYS_BLOCKED_METADATA_IPS) are always blocked to prevent credential theft in cloud environments

Troubleshooting

If you encounter errors:

Check that your instance is running
Verify the URL is correct in your environment variables
Ensure you can access the instance in your browser
Check firewall settings and network connectivity

3.9 KiB Raw Blame History