Today /sightings, /witnesses, /objects, /locations and /operations show
a name + mention count and nothing else. After this each row carries a
60-100 word bilingual narrative summary written from the chunks where
the entity actually appears.
Migration 0008 (apply as supabase_admin):
public.entities +summary_en TEXT
+summary_pt_br TEXT
+summary_generated_at TIMESTAMPTZ
+summary_model TEXT
+summary_status TEXT
CHECK ('pending'|'ai_generated'|'curated'|'refused')
+ index on summary_status
+ GRANT UPDATE (summary_*) ON entities TO investigator
+ new policy entities_investigator_update_summary (RLS UPDATE for
investigator role)
Enrichment script (investigator-runtime/scripts/enrich_entity_summaries.ts):
- Per-class config (chunk_k, min_mentions, max_per_class)
- Path A: entity_mentions JOIN chunks (high-precision linker)
- Path B (fallback): hybridSearch on canonical_name + aliases when
entity_mentions returns zero. This is what unlocked Kenneth Arnold
and similar entities — their wiki YAML has high total_mentions
counted from frontmatter mentioned_in[], but the entity_mentions
extractor was silent because the matches came from the wiki text,
not the OCR chunks.
- Sonnet 4.6 via OAuth Max, ~$0.04 per entity, ~$10 for the full
260-entity bulk run.
- INSUFFICIENT skip when chunks can't sustain a 60-word summary —
refused entries get summary_status='refused' so they're not retried.
UI uplift:
- lib/retrieval/entity-pages.ts: getEntityCore now prefers the DB
summary (ai_generated or curated) over wiki YAML narrative.
- components/entity-list-page.tsx:
* SELECT now pulls summary_en, summary_pt_br, summary_status
* Sorted with summary-enriched rows first (so the magazine grid
lands on quality content immediately)
* MagazineGrid: 4-line summary preview replaces aliases line
* CompactGrid: enriched rows render as full editorial cards,
bare rows fall back to a compact table below
Smoke results:
- Kenneth Arnold sighting: "On June 24, 1947, pilot Kenneth Arnold
reported sighting unidentified objects over the Pacific Northwest,
and the account spread worldwide. It set off a run of similar
reports: County Commissioner Crankes saw comparable objects after
Arnold's account reached the press, and United Airlines pilot
Emil H. Smith spotted flying discs on July 4 during a routine
flight out of Boise, Idaho..."
- Roswell Incident: includes Colonel Corso's 1997 book + the 1995
GAO finding that radio messages from Oct 46–Feb 47 were destroyed
+ Senator Strom Thurmond's foreword. Real magazine-grade content.
Background bulk run kicked off across all 5 classes (event,
uap_object, person, location, organization) — populating live as
the homepage rebuilds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root-cause fix for "search returns garbage for absent terms". The hybrid RPC's
dense branch always returned its k nearest vectors regardless of distance, so a
query for a term not in the corpus (e.g. "varginha") surfaced unrelated chunks.
The cross-encoder reranker would filter these but costs 18-62s on CPU —
unusable for interactive search.
Add max_dense_dist (default 0.40) to hybrid_search_chunks: dense neighbours
beyond that cosine distance are dropped server-side. Calibrated from measured
distances — strong semantic match ~0.12-0.20, no real match ~0.46-0.53. BM25
full-text still matches literal terms; the reranker becomes opt-in refinement.
Verified live: varginha/abducao → 0, disco voador/roswell → relevant, all <1s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hybrid_search RPC always returns up to recall_k dense neighbours, so a
query for a term absent from the corpus (e.g. "varginha") returned its 12
nearest vectors — irrelevant chunks like PAGE_NUMBER "1". Two bugs:
the reranker was skipped whenever results <= top_k, and there was no relevance
floor.
Now always run the cross-encoder reranker (BGE-reranker-v2-m3, normalized
sigmoid) and drop hits below 0.02. Verified: "varginha" → 0 results;
"roswell"/"tic tac"/"disco voador" → relevant hits on top (reranker cleanly
separates 0.0001 garbage from 0.03-0.27 matches).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity
JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page
extraction. Add synthesize scripts to regenerate wiki/entities from the 116
_reextract.json (30), aggregate missing page.md from chunks (31), and reprocess
805 pages the doc-rebuilder agent dropped on context overflow (32). Add
maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and
typed relation extraction.
Web: wire relations API + entity-relations component; entity/timeline/doc
pages consume the rebuilt layer.
Note: raw/, processing/, wiki/ remain gitignored (bulk data managed
separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on
disk only. The 27 curated anchor events under wiki/entities/events/ are
preserved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.