disclosure-bureau/scripts/synthesize/40_reading_version.py
Luiz Gustavo 55cac8a395
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 1m30s
CI / Scripts — Python smoke (push) Failing after 32s
CI / Web — npm audit (push) Failing after 37s
W0+W1+W1.2: security hardening, observability, autocomplete, glitchtip, forgejo CI
W0 — security hardening (5 fixes verified live on disclosure.top)
- middleware: gate /api/admin/* same as /admin/* (F1)
- imgproxy: tighten LOCAL_FILESYSTEM_ROOT from / to /var/lib/storage (F2)
- studio: real basic-auth label (bcrypt hash, middleware reference) (F3)
- relations: ENABLE ROW LEVEL SECURITY + public SELECT policy (F4)
- migration 0003: fold is_searchable + hybrid_search update into canonical (TD#2)

W1 — observability + resilience + autocomplete
- studio: HOSTNAME=0.0.0.0 so Next.js binds on loopback for healthcheck
- compose: PG_POOL_MAX=20, CLAUDE_CODE_OAUTH_TOKEN gated by separate env
- claude-code.ts: subprocess timeout configurable (CLAUDE_CODE_TIMEOUT_MS)
- openrouter.ts: retry with exponential backoff + Retry-After + in-memory
  circuit breaker (promotes FALLBACK after CB_THRESHOLD failures)
- lib/logger.ts: pino logger (NDJSON prod / pretty dev) + withRequest helper
- middleware: mints correlation_id, stamps x-correlation-id response header,
  emits structured http_request log per /api/* call
- messages/route.ts: switch to structured logger
- 60_meili_index.py: push documents + chunks into Meilisearch
- /api/search/autocomplete: parallel meili search (docs + chunks), 5-8ms p50
- search-autocomplete.tsx: debounced dropdown wired into search-panel

W1.2 — Glitchtip + Forgejo self-hosted
- compose: glitchtip-redis + glitchtip-web + glitchtip-worker (v4.2)
- compose: forgejo + forgejo-runner (server v9, runner v6) with group_add=988
- @sentry/nextjs SDK wired (instrumentation.ts + sentry.{client,server}.config.ts)
- /api/admin/throw smoke endpoint (gated by W0-F1 middleware)
- Synthetic event ingestion verified at glitchtip.disclosure.top
- forgejo.disclosure.top up, repo discadmin/disclosure-bureau created,
  runner registered (labels: ubuntu-latest, docker)
- .forgejo/workflows/ci.yml: typecheck + lint + build + npm audit + python
  syntax + compose validation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:18:42 -03:00

203 lines
8.3 KiB
Python

#!/usr/bin/env python3
"""
40_reading_version.py — Generate a clean, webdesigner-grade reading version of a
scanned document from its already-extracted chunks.
The scanned PDFs are messy: duplicate transcriptions (typed + handwritten),
OCR noise, repeated headers/banners, two classification variants of the same
narrative. This pass uses an LLM to produce ONE clean, deduplicated, well-
structured bilingual Markdown document for reading — faithful to the content
(invents nothing) but merging duplicate versions and dropping page furniture.
Output: raw/<doc>--subagent/reading.md (frontmatter + EN + PT-BR sections)
The raw chunks and per-page scan view stay untouched ("ver scan original").
Run:
python3 scripts/synthesize/40_reading_version.py <doc-id>
"""
from __future__ import annotations
import os
import subprocess
import sys
from pathlib import Path
UFO = Path("/Users/guto/ufo")
RAW = UFO / "raw"
BUILD_DOC = UFO / "scripts" / "reextract" / "build_doc_text.py"
PROMPT = """You are a meticulous archivist-typographer for The Disclosure Bureau, an
investigative wiki of declassified UAP/UFO documents. You receive the raw
machine-extracted text of ONE scanned document (chunk by chunk, with page
markers). The scan is messy: it often contains the SAME content twice (e.g. a
typed transcript followed by a handwritten re-transcription, or a SECRET//NOFORN
narrative immediately followed by a near-identical SECRET//REL version), plus
OCR noise, repeated letterheads, classification banners, page numbers and
routing stamps.
Produce ONE clean, faithful, beautifully structured reading version in Markdown.
RULES (non-negotiable):
1. FAITHFUL — never invent facts, names, dates, codes, or quotes. Use only what
is in the text. If something is redacted/illegible, keep a marker like
[redacted] / [ilegível].
2. DEDUPLICATE — when the same content appears more than once (typed vs
handwritten, NOFORN vs REL), MERGE into a single best version. Prefer the
most complete/legible wording. Never drop unique details that appear in only
one version (e.g. a line spoken by a different person).
3. DROP PAGE FURNITURE — remove repeated letterheads, classification banners,
bare page numbers, routing stamps, "DISPATCHED" stamps, distribution lists,
and OCR garbage. Keep ONE classification line at the top if present.
4. STRUCTURE — use clear Markdown: a top H1 title, short intro line, logical
headings (## sections), and for transcripts use a clean dialogue format
(**SPEAKER:** line). Preserve chronological/communication order.
5. BILINGUAL — output BOTH languages. First the full English reading version
under "## Reading (EN)", then the full Brazilian-Portuguese version under
"## Leitura (PT-BR)". PT-BR must be natural Brazilian Portuguese with correct
accents.
6. PRESERVE INVESTIGATIVE SUBSTANCE — every sighting detail, coordinate, time,
witness name, object description, and quote that matters to an investigation
must survive the cleanup.
Return ONLY the Markdown body (no code fence, no preamble). Start directly with
the H1 title line.
DOCUMENT (doc_id: {doc_id}) — raw extracted chunks follow:
{doc_text}
"""
DISALLOWED = (
"AskUserQuestion,Bash,Edit,Write,Read,Task,Glob,Grep,"
"TaskCreate,TaskUpdate,TaskList,TaskGet,TaskStop,TaskOutput,"
"Skill,ScheduleWakeup,Monitor,WebSearch,WebFetch,NotebookEdit,"
"EnterPlanMode,ExitPlanMode,EnterWorktree,ExitWorktree,"
"CronCreate,CronDelete,CronList,RemoteTrigger,ToolSearch,"
"PushNotification,ListMcpResourcesTool,ReadMcpResourceTool,"
"ShareOnboardingGuide"
)
def build_doc_text(doc_id: str) -> str:
r = subprocess.run(["python3", str(BUILD_DOC), doc_id],
capture_output=True, text=True, encoding="utf-8")
if r.returncode != 0:
sys.exit(f"build_doc_text failed: {r.stderr}")
return r.stdout
def call_llm(prompt: str) -> str:
import tempfile
env = {**os.environ, "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32000"}
with tempfile.NamedTemporaryFile(mode="w+", suffix=".txt", delete=False, encoding="utf-8") as t:
tmp = t.name
try:
with open(tmp, "wb") as out:
r = subprocess.run(
["claude", "-p", "--model", "sonnet", "--output-format", "text",
"--disallowed-tools", DISALLOWED],
input=prompt.encode("utf-8"), stdout=out, stderr=subprocess.PIPE, env=env,
timeout=1200,
)
if r.returncode != 0:
sys.exit(f"claude failed rc={r.returncode}: {r.stderr.decode('utf-8','replace')[:500]}")
return Path(tmp).read_text(encoding="utf-8")
finally:
try: os.unlink(tmp)
except OSError: pass
# Above this size, the reading version won't fit one Sonnet call (32k-token
# output ceiling + timeout), so we segment by page blocks and concatenate.
SEGMENT_THRESHOLD = 90_000
SEGMENT_CHARS = 45_000
PROMPT_SEGMENT = """You are a meticulous archivist-typographer for The Disclosure Bureau. This is
PART {n} OF {m} of a large scanned UAP/UFO document — you receive the raw
machine-extracted text of THIS part only (chunk by chunk). The scan is messy:
duplicate transcriptions, OCR noise, repeated letterheads, classification
banners, page numbers, routing stamps.
Produce a clean, faithful, well-structured reading version of THIS PART in
Markdown.
RULES:
1. FAITHFUL — never invent. Keep [redacted]/[ilegível] markers.
2. DEDUPLICATE within this part — merge repeated content, keep unique details.
3. DROP page furniture (letterheads, banners, page numbers, routing stamps, OCR
garbage).
4. STRUCTURE with clear Markdown headings (##/###) and clean dialogue
(**SPEAKER:**) for transcripts. Do NOT write a document-level H1 title (the
document already has one); start at "## Part {n}" then sub-sections.
5. BILINGUAL — for THIS part output English first under "### English", then
Brazilian Portuguese under "### Português". Natural pt-br with correct accents.
6. PRESERVE every investigative detail (sightings, coords, times, witnesses,
object descriptions, quotes).
Return ONLY the Markdown for this part (no code fence, no preamble). Start with
"## Part {n}".
DOCUMENT (doc_id: {doc_id}) — PART {n} OF {m}, raw chunks follow:
{doc_text}
"""
def segment_text(text: str) -> list[str]:
"""Split doc text into blocks at [chunk ...] markers near SEGMENT_CHARS."""
import re as _re
if len(text) <= SEGMENT_CHARS:
return [text]
starts = [m.start() for m in _re.finditer(r"^\[chunk c\d+", text, _re.MULTILINE)]
if not starts:
return [text]
segs: list[str] = []
s = 0
while s < len(text):
cap = s + SEGMENT_CHARS
if cap >= len(text):
segs.append(text[s:]); break
cands = [p for p in starts if s < p < cap]
e = cands[-1] if cands else cap
segs.append(text[s:e]); s = e
return segs
def main() -> int:
if len(sys.argv) < 2:
sys.exit("usage: 40_reading_version.py <doc-id>")
doc_id = sys.argv[1]
out_path = RAW / f"{doc_id}--subagent" / "reading.md"
print(f"[1/3] building doc text for {doc_id} ...")
doc_text = build_doc_text(doc_id)
print(f" {len(doc_text)} chars (~{len(doc_text)//4} tokens)")
print("[2/3] generating reading version (Sonnet) ...")
if len(doc_text) > SEGMENT_THRESHOLD:
segs = segment_text(doc_text)
print(f" large doc → {len(segs)} segments")
parts: list[str] = []
for i, seg in enumerate(segs, 1):
print(f" segment {i}/{len(segs)} ({len(seg)} chars) ...")
p = call_llm(PROMPT_SEGMENT.format(n=i, m=len(segs), doc_id=doc_id, doc_text=seg)).strip()
if p.startswith("```"):
p = "\n".join(l for l in p.splitlines() if not l.startswith("```")).strip()
parts.append(p)
md = "\n\n---\n\n".join(parts)
else:
md = call_llm(PROMPT.format(doc_id=doc_id, doc_text=doc_text)).strip()
if md.startswith("```"):
md = "\n".join(l for l in md.splitlines() if not l.startswith("```")).strip()
front = (
f"---\nschema_version: \"0.1.0\"\ntype: reading\ndoc_id: {doc_id}\n"
f"generator: sonnet-reading-v1\n---\n\n"
)
out_path.write_text(front + md + "\n", encoding="utf-8")
print(f"[3/3] saved {out_path} ({len(md)} chars)")
return 0
if __name__ == "__main__":
sys.exit(main())