disclosure-bureau

History

Luiz Gustavo e75ca5eda2 add clean LLM reading version of documents (the core goal) Scanned docs are messy — duplicate transcriptions (typed + handwritten), two classification variants of the same narrative, OCR noise, repeated banners. The doc page showed raw chunks, so everything appeared twice. 40_reading_version.py generates ONE clean, deduplicated, well-structured bilingual Markdown reading version per doc (Sonnet): merges duplicate versions without losing unique lines, drops page furniture, formats transcripts as dialogue. Faithful — invents nothing; redactions kept as markers. /d/[docId] now defaults to a "📖 leitura" tab rendering this clean version, with "🔍 trechos · scan original" preserving the faithful per-chunk + per-page scan view. reading.md lives in raw/<doc>--subagent/ alongside the chunks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-21 17:23:36 -03:00
..
01_anchor_events.py	phase-0: kill stubs, ship 20 curated anchor events, configure SMTP	2026-05-18 00:44:17 -03:00
20_entity_summary.py	ship: synthesize 158 entities, AG-UI artifacts, chat persistence, auth flow	2026-05-18 03:52:59 -03:00
30_rebuild_wiki_from_reextract.py	rebuild entity layer from Sonnet-vision reextract pipeline	2026-05-21 12:20:24 -03:00
31_aggregate_pages_from_chunks.py	rebuild entity layer from Sonnet-vision reextract pipeline	2026-05-21 12:20:24 -03:00
32_reprocess_missing_pages.py	fix: keep _index.json total_pages in sync after recovering pages	2026-05-21 14:32:55 -03:00
40_reading_version.py	add clean LLM reading version of documents (the core goal)	2026-05-21 17:23:36 -03:00