Root-cause fix for "search returns garbage for absent terms". The hybrid RPC's
dense branch always returned its k nearest vectors regardless of distance, so a
query for a term not in the corpus (e.g. "varginha") surfaced unrelated chunks.
The cross-encoder reranker would filter these but costs 18-62s on CPU —
unusable for interactive search.
Add max_dense_dist (default 0.40) to hybrid_search_chunks: dense neighbours
beyond that cosine distance are dropped server-side. Calibrated from measured
distances — strong semantic match ~0.12-0.20, no real match ~0.46-0.53. BM25
full-text still matches literal terms; the reranker becomes opt-in refinement.
Verified live: varginha/abducao → 0, disco voador/roswell → relevant, all <1s.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hybrid_search RPC always returns up to recall_k dense neighbours, so a
query for a term absent from the corpus (e.g. "varginha") returned its 12
nearest vectors — irrelevant chunks like PAGE_NUMBER "1". Two bugs:
the reranker was skipped whenever results <= top_k, and there was no relevance
floor.
Now always run the cross-encoder reranker (BGE-reranker-v2-m3, normalized
sigmoid) and drop hits below 0.02. Verified: "varginha" → 0 results;
"roswell"/"tic tac"/"disco voador" → relevant hits on top (reranker cleanly
separates 0.0001 garbage from 0.03-0.27 matches).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>