W0 — security hardening (5 fixes verified live on disclosure.top)
- middleware: gate /api/admin/* same as /admin/* (F1)
- imgproxy: tighten LOCAL_FILESYSTEM_ROOT from / to /var/lib/storage (F2)
- studio: real basic-auth label (bcrypt hash, middleware reference) (F3)
- relations: ENABLE ROW LEVEL SECURITY + public SELECT policy (F4)
- migration 0003: fold is_searchable + hybrid_search update into canonical (TD#2)
W1 — observability + resilience + autocomplete
- studio: HOSTNAME=0.0.0.0 so Next.js binds on loopback for healthcheck
- compose: PG_POOL_MAX=20, CLAUDE_CODE_OAUTH_TOKEN gated by separate env
- claude-code.ts: subprocess timeout configurable (CLAUDE_CODE_TIMEOUT_MS)
- openrouter.ts: retry with exponential backoff + Retry-After + in-memory
circuit breaker (promotes FALLBACK after CB_THRESHOLD failures)
- lib/logger.ts: pino logger (NDJSON prod / pretty dev) + withRequest helper
- middleware: mints correlation_id, stamps x-correlation-id response header,
emits structured http_request log per /api/* call
- messages/route.ts: switch to structured logger
- 60_meili_index.py: push documents + chunks into Meilisearch
- /api/search/autocomplete: parallel meili search (docs + chunks), 5-8ms p50
- search-autocomplete.tsx: debounced dropdown wired into search-panel
W1.2 — Glitchtip + Forgejo self-hosted
- compose: glitchtip-redis + glitchtip-web + glitchtip-worker (v4.2)
- compose: forgejo + forgejo-runner (server v9, runner v6) with group_add=988
- @sentry/nextjs SDK wired (instrumentation.ts + sentry.{client,server}.config.ts)
- /api/admin/throw smoke endpoint (gated by W0-F1 middleware)
- Synthetic event ingestion verified at glitchtip.disclosure.top
- forgejo.disclosure.top up, repo discadmin/disclosure-bureau created,
runner registered (labels: ubuntu-latest, docker)
- .forgejo/workflows/ci.yml: typecheck + lint + build + npm audit + python
syntax + compose validation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
203 lines
8.3 KiB
Python
203 lines
8.3 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
40_reading_version.py — Generate a clean, webdesigner-grade reading version of a
|
|
scanned document from its already-extracted chunks.
|
|
|
|
The scanned PDFs are messy: duplicate transcriptions (typed + handwritten),
|
|
OCR noise, repeated headers/banners, two classification variants of the same
|
|
narrative. This pass uses an LLM to produce ONE clean, deduplicated, well-
|
|
structured bilingual Markdown document for reading — faithful to the content
|
|
(invents nothing) but merging duplicate versions and dropping page furniture.
|
|
|
|
Output: raw/<doc>--subagent/reading.md (frontmatter + EN + PT-BR sections)
|
|
|
|
The raw chunks and per-page scan view stay untouched ("ver scan original").
|
|
|
|
Run:
|
|
python3 scripts/synthesize/40_reading_version.py <doc-id>
|
|
"""
|
|
from __future__ import annotations
|
|
import os
|
|
import subprocess
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
UFO = Path("/Users/guto/ufo")
|
|
RAW = UFO / "raw"
|
|
BUILD_DOC = UFO / "scripts" / "reextract" / "build_doc_text.py"
|
|
|
|
PROMPT = """You are a meticulous archivist-typographer for The Disclosure Bureau, an
|
|
investigative wiki of declassified UAP/UFO documents. You receive the raw
|
|
machine-extracted text of ONE scanned document (chunk by chunk, with page
|
|
markers). The scan is messy: it often contains the SAME content twice (e.g. a
|
|
typed transcript followed by a handwritten re-transcription, or a SECRET//NOFORN
|
|
narrative immediately followed by a near-identical SECRET//REL version), plus
|
|
OCR noise, repeated letterheads, classification banners, page numbers and
|
|
routing stamps.
|
|
|
|
Produce ONE clean, faithful, beautifully structured reading version in Markdown.
|
|
|
|
RULES (non-negotiable):
|
|
1. FAITHFUL — never invent facts, names, dates, codes, or quotes. Use only what
|
|
is in the text. If something is redacted/illegible, keep a marker like
|
|
[redacted] / [ilegível].
|
|
2. DEDUPLICATE — when the same content appears more than once (typed vs
|
|
handwritten, NOFORN vs REL), MERGE into a single best version. Prefer the
|
|
most complete/legible wording. Never drop unique details that appear in only
|
|
one version (e.g. a line spoken by a different person).
|
|
3. DROP PAGE FURNITURE — remove repeated letterheads, classification banners,
|
|
bare page numbers, routing stamps, "DISPATCHED" stamps, distribution lists,
|
|
and OCR garbage. Keep ONE classification line at the top if present.
|
|
4. STRUCTURE — use clear Markdown: a top H1 title, short intro line, logical
|
|
headings (## sections), and for transcripts use a clean dialogue format
|
|
(**SPEAKER:** line). Preserve chronological/communication order.
|
|
5. BILINGUAL — output BOTH languages. First the full English reading version
|
|
under "## Reading (EN)", then the full Brazilian-Portuguese version under
|
|
"## Leitura (PT-BR)". PT-BR must be natural Brazilian Portuguese with correct
|
|
accents.
|
|
6. PRESERVE INVESTIGATIVE SUBSTANCE — every sighting detail, coordinate, time,
|
|
witness name, object description, and quote that matters to an investigation
|
|
must survive the cleanup.
|
|
|
|
Return ONLY the Markdown body (no code fence, no preamble). Start directly with
|
|
the H1 title line.
|
|
|
|
DOCUMENT (doc_id: {doc_id}) — raw extracted chunks follow:
|
|
|
|
{doc_text}
|
|
"""
|
|
|
|
DISALLOWED = (
|
|
"AskUserQuestion,Bash,Edit,Write,Read,Task,Glob,Grep,"
|
|
"TaskCreate,TaskUpdate,TaskList,TaskGet,TaskStop,TaskOutput,"
|
|
"Skill,ScheduleWakeup,Monitor,WebSearch,WebFetch,NotebookEdit,"
|
|
"EnterPlanMode,ExitPlanMode,EnterWorktree,ExitWorktree,"
|
|
"CronCreate,CronDelete,CronList,RemoteTrigger,ToolSearch,"
|
|
"PushNotification,ListMcpResourcesTool,ReadMcpResourceTool,"
|
|
"ShareOnboardingGuide"
|
|
)
|
|
|
|
|
|
def build_doc_text(doc_id: str) -> str:
|
|
r = subprocess.run(["python3", str(BUILD_DOC), doc_id],
|
|
capture_output=True, text=True, encoding="utf-8")
|
|
if r.returncode != 0:
|
|
sys.exit(f"build_doc_text failed: {r.stderr}")
|
|
return r.stdout
|
|
|
|
|
|
def call_llm(prompt: str) -> str:
|
|
import tempfile
|
|
env = {**os.environ, "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32000"}
|
|
with tempfile.NamedTemporaryFile(mode="w+", suffix=".txt", delete=False, encoding="utf-8") as t:
|
|
tmp = t.name
|
|
try:
|
|
with open(tmp, "wb") as out:
|
|
r = subprocess.run(
|
|
["claude", "-p", "--model", "sonnet", "--output-format", "text",
|
|
"--disallowed-tools", DISALLOWED],
|
|
input=prompt.encode("utf-8"), stdout=out, stderr=subprocess.PIPE, env=env,
|
|
timeout=1200,
|
|
)
|
|
if r.returncode != 0:
|
|
sys.exit(f"claude failed rc={r.returncode}: {r.stderr.decode('utf-8','replace')[:500]}")
|
|
return Path(tmp).read_text(encoding="utf-8")
|
|
finally:
|
|
try: os.unlink(tmp)
|
|
except OSError: pass
|
|
|
|
|
|
# Above this size, the reading version won't fit one Sonnet call (32k-token
|
|
# output ceiling + timeout), so we segment by page blocks and concatenate.
|
|
SEGMENT_THRESHOLD = 90_000
|
|
SEGMENT_CHARS = 45_000
|
|
|
|
PROMPT_SEGMENT = """You are a meticulous archivist-typographer for The Disclosure Bureau. This is
|
|
PART {n} OF {m} of a large scanned UAP/UFO document — you receive the raw
|
|
machine-extracted text of THIS part only (chunk by chunk). The scan is messy:
|
|
duplicate transcriptions, OCR noise, repeated letterheads, classification
|
|
banners, page numbers, routing stamps.
|
|
|
|
Produce a clean, faithful, well-structured reading version of THIS PART in
|
|
Markdown.
|
|
|
|
RULES:
|
|
1. FAITHFUL — never invent. Keep [redacted]/[ilegível] markers.
|
|
2. DEDUPLICATE within this part — merge repeated content, keep unique details.
|
|
3. DROP page furniture (letterheads, banners, page numbers, routing stamps, OCR
|
|
garbage).
|
|
4. STRUCTURE with clear Markdown headings (##/###) and clean dialogue
|
|
(**SPEAKER:**) for transcripts. Do NOT write a document-level H1 title (the
|
|
document already has one); start at "## Part {n}" then sub-sections.
|
|
5. BILINGUAL — for THIS part output English first under "### English", then
|
|
Brazilian Portuguese under "### Português". Natural pt-br with correct accents.
|
|
6. PRESERVE every investigative detail (sightings, coords, times, witnesses,
|
|
object descriptions, quotes).
|
|
|
|
Return ONLY the Markdown for this part (no code fence, no preamble). Start with
|
|
"## Part {n}".
|
|
|
|
DOCUMENT (doc_id: {doc_id}) — PART {n} OF {m}, raw chunks follow:
|
|
|
|
{doc_text}
|
|
"""
|
|
|
|
|
|
def segment_text(text: str) -> list[str]:
|
|
"""Split doc text into blocks at [chunk ...] markers near SEGMENT_CHARS."""
|
|
import re as _re
|
|
if len(text) <= SEGMENT_CHARS:
|
|
return [text]
|
|
starts = [m.start() for m in _re.finditer(r"^\[chunk c\d+", text, _re.MULTILINE)]
|
|
if not starts:
|
|
return [text]
|
|
segs: list[str] = []
|
|
s = 0
|
|
while s < len(text):
|
|
cap = s + SEGMENT_CHARS
|
|
if cap >= len(text):
|
|
segs.append(text[s:]); break
|
|
cands = [p for p in starts if s < p < cap]
|
|
e = cands[-1] if cands else cap
|
|
segs.append(text[s:e]); s = e
|
|
return segs
|
|
|
|
|
|
def main() -> int:
|
|
if len(sys.argv) < 2:
|
|
sys.exit("usage: 40_reading_version.py <doc-id>")
|
|
doc_id = sys.argv[1]
|
|
out_path = RAW / f"{doc_id}--subagent" / "reading.md"
|
|
|
|
print(f"[1/3] building doc text for {doc_id} ...")
|
|
doc_text = build_doc_text(doc_id)
|
|
print(f" {len(doc_text)} chars (~{len(doc_text)//4} tokens)")
|
|
|
|
print("[2/3] generating reading version (Sonnet) ...")
|
|
if len(doc_text) > SEGMENT_THRESHOLD:
|
|
segs = segment_text(doc_text)
|
|
print(f" large doc → {len(segs)} segments")
|
|
parts: list[str] = []
|
|
for i, seg in enumerate(segs, 1):
|
|
print(f" segment {i}/{len(segs)} ({len(seg)} chars) ...")
|
|
p = call_llm(PROMPT_SEGMENT.format(n=i, m=len(segs), doc_id=doc_id, doc_text=seg)).strip()
|
|
if p.startswith("```"):
|
|
p = "\n".join(l for l in p.splitlines() if not l.startswith("```")).strip()
|
|
parts.append(p)
|
|
md = "\n\n---\n\n".join(parts)
|
|
else:
|
|
md = call_llm(PROMPT.format(doc_id=doc_id, doc_text=doc_text)).strip()
|
|
if md.startswith("```"):
|
|
md = "\n".join(l for l in md.splitlines() if not l.startswith("```")).strip()
|
|
|
|
front = (
|
|
f"---\nschema_version: \"0.1.0\"\ntype: reading\ndoc_id: {doc_id}\n"
|
|
f"generator: sonnet-reading-v1\n---\n\n"
|
|
)
|
|
out_path.write_text(front + md + "\n", encoding="utf-8")
|
|
print(f"[3/3] saved {out_path} ({len(md)} chars)")
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
sys.exit(main())
|