disclosure-bureau/web/lib/retrieval
Luiz Gustavo 504b20fa5c search: gate dense recall by cosine-distance threshold in the RPC
Root-cause fix for "search returns garbage for absent terms". The hybrid RPC's
dense branch always returned its k nearest vectors regardless of distance, so a
query for a term not in the corpus (e.g. "varginha") surfaced unrelated chunks.
The cross-encoder reranker would filter these but costs 18-62s on CPU —
unusable for interactive search.

Add max_dense_dist (default 0.40) to hybrid_search_chunks: dense neighbours
beyond that cosine distance are dropped server-side. Calibrated from measured
distances — strong semantic match ~0.12-0.20, no real match ~0.46-0.53. BM25
full-text still matches literal terms; the reranker becomes opt-in refinement.

Verified live: varginha/abducao → 0, disco voador/roswell → relevant, all <1s.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 16:36:56 -03:00
..
db.ts baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00
embed.ts baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00
entity-pages.ts rebuild entity layer from Sonnet-vision reextract pipeline 2026-05-21 12:20:24 -03:00
graph.ts rebuild entity layer from Sonnet-vision reextract pipeline 2026-05-21 12:20:24 -03:00
hybrid.ts search: gate dense recall by cosine-distance threshold in the RPC 2026-05-21 16:36:56 -03:00