disclosure-bureau

Author	SHA1	Message	Date
Luiz Gustavo	f2b7b116ce	W5.3 (Phase 3A): entity summaries — sub-pages get magazine-grade prose Some checks failed CI / Web — typecheck + lint + build (push) Failing after 45s Details CI / Scripts — Python smoke (push) Failing after 4s Details CI / Web — npm audit (push) Failing after 41s Details CI / Retrieval — golden set (Recall@5 + MRR) (push) Failing after 3s Details Today /sightings, /witnesses, /objects, /locations and /operations show a name + mention count and nothing else. After this each row carries a 60-100 word bilingual narrative summary written from the chunks where the entity actually appears. Migration 0008 (apply as supabase_admin): public.entities +summary_en TEXT +summary_pt_br TEXT +summary_generated_at TIMESTAMPTZ +summary_model TEXT +summary_status TEXT CHECK ('pending'\|'ai_generated'\|'curated'\|'refused') + index on summary_status + GRANT UPDATE (summary_) ON entities TO investigator + new policy entities_investigator_update_summary (RLS UPDATE for investigator role) Enrichment script (investigator-runtime/scripts/enrich_entity_summaries.ts): - Per-class config (chunk_k, min_mentions, max_per_class) - Path A: entity_mentions JOIN chunks (high-precision linker) - Path B (fallback): hybridSearch on canonical_name + aliases when entity_mentions returns zero. This is what unlocked Kenneth Arnold and similar entities — their wiki YAML has high total_mentions counted from frontmatter mentioned_in[], but the entity_mentions extractor was silent because the matches came from the wiki text, not the OCR chunks. - Sonnet 4.6 via OAuth Max, ~$0.04 per entity, ~$10 for the full 260-entity bulk run. - INSUFFICIENT skip when chunks can't sustain a 60-word summary — refused entries get summary_status='refused' so they're not retried. UI uplift: - lib/retrieval/entity-pages.ts: getEntityCore now prefers the DB summary (ai_generated or curated) over wiki YAML narrative. - components/entity-list-page.tsx: SELECT now pulls summary_en, summary_pt_br, summary_status * Sorted with summary-enriched rows first (so the magazine grid lands on quality content immediately) * MagazineGrid: 4-line summary preview replaces aliases line * CompactGrid: enriched rows render as full editorial cards, bare rows fall back to a compact table below Smoke results: - Kenneth Arnold sighting: "On June 24, 1947, pilot Kenneth Arnold reported sighting unidentified objects over the Pacific Northwest, and the account spread worldwide. It set off a run of similar reports: County Commissioner Crankes saw comparable objects after Arnold's account reached the press, and United Airlines pilot Emil H. Smith spotted flying discs on July 4 during a routine flight out of Boise, Idaho..." - Roswell Incident: includes Colonel Corso's 1997 book + the 1995 GAO finding that radio messages from Oct 46–Feb 47 were destroyed + Senator Strom Thurmond's foreword. Real magazine-grade content. Background bulk run kicked off across all 5 classes (event, uap_object, person, location, organization) — populating live as the homepage rebuilds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 15:37:01 -03:00
Luiz Gustavo	eaf282c535	W2: rerank opt-in, analyze_image_region tool, RAG eval, graph cleanup, ADRs Some checks failed CI / Web — typecheck + lint + build (push) Failing after 40s Details CI / Scripts — Python smoke (push) Failing after 3s Details CI / Web — npm audit (push) Failing after 29s Details CI / Retrieval — golden set (Recall@5 + MRR) (push) Failing after 3s Details - TD#8 hybrid.ts: rerank_strategy {always\|when_top_k_gt\|never} + threshold (default skips rerank for top_k ≤ 15; chat tool uses threshold 10) - O11 vision.ts + tools.ts: analyze_image_region tool — sharp-crops the bbox, claude CLI reads the temp PNG via Read tool, Sonnet vision answers - TD#12 /graph: SigmaGraph replaces ForceGraphCanvas; react-force-graph-2d uninstalled (-37 transitive deps); force-graph-canvas.tsx deleted - TD#27 messages/route.ts gatherContext slice sizes via CTX_* env vars - TD#22 tests/rag/: golden.yaml (15 queries) + run.py (Recall@k + MRR + negative-pass rate) + baseline.json + CI job in .forgejo/workflows/ci.yml - docs/adrs/: ADR-001..005 published from systems-atelier deliverables Verified live on disclosure.top: top_k=5 path skips rerank (6.7s embed-only, was 12-15s with rerank); rerank=always still available on demand. First RAG baseline: Recall@5 = 0.2083, MRR = 0.25, Negative pass = 1.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 19:20:09 -03:00
Luiz Gustavo	504b20fa5c	search: gate dense recall by cosine-distance threshold in the RPC Root-cause fix for "search returns garbage for absent terms". The hybrid RPC's dense branch always returned its k nearest vectors regardless of distance, so a query for a term not in the corpus (e.g. "varginha") surfaced unrelated chunks. The cross-encoder reranker would filter these but costs 18-62s on CPU — unusable for interactive search. Add max_dense_dist (default 0.40) to hybrid_search_chunks: dense neighbours beyond that cosine distance are dropped server-side. Calibrated from measured distances — strong semantic match ~0.12-0.20, no real match ~0.46-0.53. BM25 full-text still matches literal terms; the reranker becomes opt-in refinement. Verified live: varginha/abducao → 0, disco voador/roswell → relevant, all <1s. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 16:36:56 -03:00
Luiz Gustavo	4865f974b6	fix search: rerank-gate results so absent terms return nothing The hybrid_search RPC always returns up to recall_k dense neighbours, so a query for a term absent from the corpus (e.g. "varginha") returned its 12 nearest vectors — irrelevant chunks like PAGE_NUMBER "1". Two bugs: the reranker was skipped whenever results <= top_k, and there was no relevance floor. Now always run the cross-encoder reranker (BGE-reranker-v2-m3, normalized sigmoid) and drop hits below 0.02. Verified: "varginha" → 0 results; "roswell"/"tic tac"/"disco voador" → relevant hits on top (reranker cleanly separates 0.0001 garbage from 0.03-0.27 matches). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 14:46:49 -03:00
Luiz Gustavo	a7e9dce6d2	rebuild entity layer from Sonnet-vision reextract pipeline Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page extraction. Add synthesize scripts to regenerate wiki/entities from the 116 _reextract.json (30), aggregate missing page.md from chunks (31), and reprocess 805 pages the doc-rebuilder agent dropped on context overflow (32). Add maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and typed relation extraction. Web: wire relations API + entity-relations component; entity/timeline/doc pages consume the rebuilt layer. Note: raw/, processing/, wiki/ remain gitignored (bulk data managed separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on disk only. The 27 curated anchor events under wiki/entities/events/ are preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 12:20:24 -03:00
guto	291748df63	sanitize entities: single YAML source of truth, signal_strength badge The corpus had two parallel reverse-reference signals: the wiki/pages entities_extracted blocks (Haiku page-level) and public.entity_mentions (Sonnet chunk-level, ILIKE-matched). The entity page only consulted the DB, so it showed "0 menções" for thousands of entities that were anchored in pages or in cross-entity links the DB never indexed. Resolved by collapsing all signals into the YAML frontmatter, which is now the single runtime source for entity metadata. scripts/maintain/42_sync_entity_stats.py walks every entity and writes: mentioned_in: [...] # consolidated page refs total_mentions: max(db, pages) documents_count: max(db_docs, distinct page docs) signal_sources: db_chunks: int page_refs: int cross_refs: int signal_strength: strong \| weak \| orphan \| unverified referenced_by: [[class/id]] # cross-entity backlinks Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the entity's own cross_refs so anchored-but-not-mentioned entities don't register as orphan. OBJ canonical names like "7m long, 1.3m high, two rocket motors, smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)" are rewritten to "Peyerl shot down UAP" derived from observed_in_event, preserving the full description as an alias. --fix-obj-names did this for every OBJ-* with >80 char canonical_name. Default behaviour is conservative: --archive-only-junk archives only single/double-char names and pure-numeric noise. Everything else stays on disk with signal_strength marked, so the user can review later. web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first. The /e/[cls]/[id] page now reads counts straight from YAML and renders a "força do sinal" badge with the per-source breakdown. Orphan entities get a banner explaining they have no cross-references. DB is still queried for ONE thing: the chunk text for preview cards on the entity page, so we don't re-parse 21k markdown files on every render. First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945- PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink" in the live UI.	2026-05-18 19:49:31 -03:00
guto	19d0678e55	baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack	2026-05-17 22:44:36 -03:00

7 commits