mirror of
https://github.com/LearningCircuit/local-deep-research.git
synced 2026-06-15 19:46:56 +03:00
* fix(pdf): render CJK characters in exported PDFs (#4055) The PDF stylesheet hard-coded a Latin-only font stack, so WeasyPrint silently dropped Chinese/Japanese/Korean glyphs from downloads even when they rendered fine in the HTML view. Add Noto Sans CJK / Microsoft YaHei / SimSun fallbacks for both body and monospace families, and install fonts-noto-cjk in the Docker runtime stage so the slim base image actually has glyph coverage. Non-Docker installs still need a CJK font package on the host. * fix(pdf): broaden CJK font fallbacks + document host requirement Extend the PDF CSS font stack to cover macOS (PingFang, Hiragino, Apple SD Gothic Neo) and additional Windows families (Microsoft JhengHei, Yu Gothic, Malgun Gothic), so pip installs on those platforms render CJK without any user action. Document the per-distro CJK font install command in install-pip.md and add a new FAQ entry. Linux pip/server hosts still need fonts-noto-cjk installed manually — there is no in-code way to fix that without bundling ~20 MB of fonts into the wheel. * test(pdf): assert CJK glyph embedding end-to-end (#4055) Round-trip CJK text through markdown → PDF → pypdf extract_text so CI fails if fonts-noto-cjk is ever removed from the Docker runtime image. The pytest-tests job runs inside that image, so the test sees the installed fonts; bare hosts without CJK fonts skip the assertion via an fc-list gate. Does not catch CSS-fallback-stack regressions on its own: fontconfig auto-substitutes a CJK family on Linux even for a Latin-only stack. The CSS fallbacks still matter on Windows/macOS, which CI does not exercise — documented in the test docstring.
This commit is contained in:
@@ -255,6 +255,10 @@ RUN apt-get update && apt-get upgrade -y \
|
||||
shared-mime-info \
|
||||
# GLib and GObject dependencies (libgobject is included in libglib2.0-0)
|
||||
libglib2.0-0 \
|
||||
# CJK fonts so WeasyPrint can render Chinese/Japanese/Korean glyphs
|
||||
# in exported PDFs — without these the slim base image has no CJK
|
||||
# coverage and CJK text vanishes from the output (issue #4055).
|
||||
fonts-noto-cjk \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Create non-root user for running service (security best practice)
|
||||
|
||||
1
changelog.d/4055.bugfix.md
Normal file
1
changelog.d/4055.bugfix.md
Normal file
@@ -0,0 +1 @@
|
||||
**Chinese/Japanese/Korean text now renders in exported PDFs.** The default PDF stylesheet hard-coded a Latin-only font stack, so any CJK characters in the research result were dropped silently from the download even though they displayed correctly in the browser. The minimal CSS now includes a broad CJK fallback chain (Noto Sans CJK, PingFang, Hiragino, Apple SD Gothic Neo, Microsoft YaHei/JhengHei, Yu Gothic, Malgun Gothic, SimSun) covering Windows, macOS, and Linux desktops out of the box, and the Docker image now installs `fonts-noto-cjk` so the slim base image has glyph coverage. Linux pip/server installs still need a CJK font package on the host — see [install-pip.md](../install-pip.md) and the [FAQ](../faq.md#chinese--japanese--korean-text-is-missing-from-exported-pdfs) for the per-distro commands.
|
||||
12
docs/faq.md
12
docs/faq.md
@@ -197,6 +197,18 @@ This issue should be fixed in recent versions. If you encounter it, ensure you'r
|
||||
|
||||
Use `LDR_SEARCH_TOOL` instead if needed.
|
||||
|
||||
### Chinese / Japanese / Korean text is missing from exported PDFs
|
||||
|
||||
PDF export uses WeasyPrint, which resolves glyphs through the host's installed fonts. If your system has no CJK font installed, those characters disappear silently from the PDF even though they render fine in the browser. Install a CJK font package:
|
||||
|
||||
- **Debian/Ubuntu:** `sudo apt install fonts-noto-cjk && fc-cache -fv`
|
||||
- **Fedora/RHEL:** `sudo dnf install google-noto-sans-cjk-fonts && fc-cache -fv`
|
||||
- **Alpine:** `apk add font-noto-cjk`
|
||||
- **macOS / Windows:** CJK fonts ship with the OS — no install needed.
|
||||
- **Docker (official image):** `fonts-noto-cjk` is bundled, no action needed.
|
||||
|
||||
After installing, restart LDR and re-export the PDF.
|
||||
|
||||
## Search Engines
|
||||
|
||||
### SearXNG connection errors
|
||||
|
||||
@@ -46,6 +46,16 @@ pip install "local-deep-research[mcp]"
|
||||
|
||||
> **Windows PDF Export:** PDF export requires Pango/Cairo system libraries. See the [WeasyPrint installation guide](https://doc.courtbouillon.org/weasyprint/stable/first_steps.html) for setup instructions.
|
||||
|
||||
> **CJK characters in PDF exports:** WeasyPrint resolves glyphs through the host's installed fonts. If your research results contain Chinese, Japanese, or Korean characters and they disappear from the downloaded PDF, install a CJK font package:
|
||||
>
|
||||
> - **Debian/Ubuntu:** `sudo apt install fonts-noto-cjk && fc-cache -fv`
|
||||
> - **Fedora/RHEL:** `sudo dnf install google-noto-sans-cjk-fonts && fc-cache -fv`
|
||||
> - **Alpine:** `apk add font-noto-cjk`
|
||||
> - **macOS:** ships with PingFang / Hiragino — no install needed.
|
||||
> - **Windows:** ships with Microsoft YaHei / SimSun — no install needed.
|
||||
>
|
||||
> Docker users on the official image do not need to do anything; `fonts-noto-cjk` is bundled.
|
||||
|
||||
## Development from Source
|
||||
|
||||
For contributing or running from the latest code, see the [Development Guide](developing.md).
|
||||
|
||||
@@ -99,6 +99,12 @@ class PDFService:
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize PDF service with minimal CSS for readability."""
|
||||
# CJK families are listed as fallbacks so WeasyPrint substitutes a
|
||||
# glyph-bearing font when the primary stack lacks coverage. Without
|
||||
# this, Chinese/Japanese/Korean text disappears silently from the
|
||||
# PDF even though it renders fine in the HTML view (issue #4055).
|
||||
# Glyphs still require the corresponding system font (e.g.
|
||||
# fonts-noto-cjk) to actually be installed.
|
||||
self.minimal_css = CSS(
|
||||
string="""
|
||||
@page {
|
||||
@@ -107,7 +113,12 @@ class PDFService:
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
font-family: Arial, "Noto Sans CJK SC", "Noto Sans CJK TC",
|
||||
"Noto Sans CJK JP", "Noto Sans CJK KR", "Noto Sans SC",
|
||||
"PingFang SC", "PingFang TC", "Hiragino Sans",
|
||||
"Hiragino Kaku Gothic ProN", "Apple SD Gothic Neo",
|
||||
"Microsoft YaHei", "Microsoft JhengHei",
|
||||
"Yu Gothic", "Malgun Gothic", "SimSun", sans-serif;
|
||||
font-size: 10pt;
|
||||
line-height: 1.4;
|
||||
}
|
||||
@@ -135,14 +146,20 @@ class PDFService:
|
||||
h5 { font-size: 10pt; margin: 0.5em 0; font-weight: bold; }
|
||||
h6 { font-size: 10pt; margin: 0.5em 0; }
|
||||
|
||||
code {
|
||||
font-family: monospace;
|
||||
code, pre {
|
||||
font-family: monospace, "Noto Sans Mono CJK SC",
|
||||
"Noto Sans Mono CJK TC", "Noto Sans Mono CJK JP",
|
||||
"Noto Sans Mono CJK KR", "Noto Sans CJK SC",
|
||||
"PingFang SC", "Hiragino Sans", "Apple SD Gothic Neo",
|
||||
"Microsoft YaHei", "SimSun";
|
||||
background-color: #f5f5f5;
|
||||
}
|
||||
|
||||
code {
|
||||
padding: 1px 3px;
|
||||
}
|
||||
|
||||
pre {
|
||||
background-color: #f5f5f5;
|
||||
padding: 8px;
|
||||
overflow-x: auto;
|
||||
}
|
||||
|
||||
@@ -3,10 +3,33 @@ Comprehensive tests for PDFService.
|
||||
Tests PDF generation, markdown conversion, CSS handling, and metadata.
|
||||
"""
|
||||
|
||||
import io
|
||||
import subprocess
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch
|
||||
|
||||
|
||||
def _host_has_cjk_fonts() -> bool:
|
||||
"""True if fontconfig reports any CJK-capable family on the host.
|
||||
|
||||
Used to gate the strong glyph-embedding assertion in
|
||||
test_handles_cjk_content_embeds_glyphs: Docker CI (where this PR
|
||||
installs fonts-noto-cjk) and properly-configured dev hosts run the
|
||||
assertion; bare pip/macOS/Windows installs without Noto CJK skip it.
|
||||
"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["fc-list", ":lang=zh"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5,
|
||||
)
|
||||
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||
return False
|
||||
return bool(result.stdout.strip())
|
||||
|
||||
|
||||
class TestPDFServiceInit:
|
||||
"""Tests for PDFService initialization."""
|
||||
|
||||
@@ -219,6 +242,72 @@ class TestPDFServiceMarkdownToPdf:
|
||||
|
||||
assert result.startswith(b"%PDF")
|
||||
|
||||
def test_handles_cjk_content(self, service):
|
||||
"""Regression guard for #4055 — CJK text must not crash export.
|
||||
|
||||
Glyph visibility depends on the host having CJK fonts installed
|
||||
(e.g. fonts-noto-cjk). This test pins the no-crash contract; the
|
||||
glyph-embedding contract is checked separately in
|
||||
test_handles_cjk_content_embeds_glyphs.
|
||||
"""
|
||||
cjk_markdown = (
|
||||
"# 中文标题\n\n这是一个测试。\n\n"
|
||||
"## 日本語の見出し\n\nテスト本文。\n\n"
|
||||
"## 한국어 제목\n\n본문 테스트.\n"
|
||||
)
|
||||
|
||||
result = service.markdown_to_pdf(cjk_markdown)
|
||||
|
||||
assert result.startswith(b"%PDF")
|
||||
assert len(result) > 1000
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not _host_has_cjk_fonts(),
|
||||
reason="no CJK fonts on host; install fonts-noto-cjk to run",
|
||||
)
|
||||
def test_handles_cjk_content_embeds_glyphs(self, service):
|
||||
"""End-to-end glyph check for #4055.
|
||||
|
||||
Round-trips CJK text through markdown → PDF → text extraction.
|
||||
If no CJK-capable font is available to WeasyPrint, glyphs are
|
||||
dropped and pypdf's extract_text returns the rendered text
|
||||
without them.
|
||||
|
||||
What this catches: removal of fonts-noto-cjk from the runtime
|
||||
Docker image (the load-bearing half of the fix) — pytest-tests
|
||||
runs inside that image, so the test fails when fonts are gone.
|
||||
|
||||
What this does NOT catch: removal of the CSS CJK fallback list.
|
||||
On Linux, fontconfig auto-substitutes a glyph-bearing family
|
||||
even for a Latin-only CSS stack, so the assertion still passes.
|
||||
The CSS fallbacks matter on Windows/macOS, which CI doesn't run.
|
||||
"""
|
||||
import pypdf
|
||||
|
||||
cjk_markdown = (
|
||||
"# 中文标题\n\n这是一个测试。\n\n"
|
||||
"## 日本語の見出し\n\nテスト本文。\n\n"
|
||||
"## 한국어 제목\n\n본문 테스트.\n"
|
||||
)
|
||||
|
||||
pdf_bytes = service.markdown_to_pdf(cjk_markdown)
|
||||
reader = pypdf.PdfReader(io.BytesIO(pdf_bytes))
|
||||
extracted = "".join(page.extract_text() for page in reader.pages)
|
||||
|
||||
for phrase in (
|
||||
"中文标题",
|
||||
"这是一个测试",
|
||||
"日本語",
|
||||
"テスト",
|
||||
"한국어",
|
||||
"본문",
|
||||
):
|
||||
assert phrase in extracted, (
|
||||
f"CJK glyphs missing from PDF: {phrase!r} not found in "
|
||||
f"extracted text — WeasyPrint could not match a glyph "
|
||||
f"(check that fonts-noto-cjk is installed on the host)"
|
||||
)
|
||||
|
||||
def test_logs_pdf_size(self, service, simple_markdown):
|
||||
"""Test that PDF size is logged."""
|
||||
with patch(
|
||||
|
||||
Reference in New Issue
Block a user