disclosure-bureau/infra/RETRIEVAL.md

80 lines
3.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Retrieval Pipeline — disclosure.top chunks layer
Hybrid retrieval over the agentic chunks (raw/<doc-id>--subagent/) using:
- **BGE-M3** dense embeddings (1024-dim, multilingual, self-hosted, gratis)
- **pgvector HNSW** index (Postgres 15.8.1 in disclosure-stack, supabase image ships pgvector)
- **Postgres tsvector** BM25 (`pt_unaccent` + `en_unaccent` configs)
- **BGE-Reranker-v2-M3** cross-encoder rerank (self-hosted)
- **RRF fusion** of BM25 + dense → reranker → final top-k
Cost: ~$0/month after initial $0 setup. LLM stays on OpenRouter (deepseek-v4-flash:free or paid model of choice).
## Components
| Path | Purpose |
|---|---|
| `infra/embed-service/` | Python FastAPI on CPU — BGE-M3 + reranker |
| `infra/supabase/migrations/0002_chunks_retrieval.sql` | pgvector + tsvector + chunks/documents/entities tables + `public.hybrid_search_chunks` RPC |
| `scripts/30-index-chunks-to-db.py` | Reads `raw/<doc>--subagent/_index.json + chunks/c*.md`, embeds via embed-service, UPSERTs to Postgres |
| `web/lib/retrieval/` | TS client (db.ts, embed.ts, hybrid.ts) |
| `web/lib/chat/tools.ts` | `hybrid_search`, `read_chunk`, `get_page_chunks`, `list_anomalies` tools |
| `web/app/api/crop/` | On-demand bbox crop service (sharp) — used by chunk views + chat citations |
| `web/app/d/[docId]/v2/` | Rich render using chunks (inline images + tables + cite anchors) |
## End-to-end flow
```
PDFs (raw/*.pdf)
│ pdftoppm 72 DPI + pdftotext
processing/png/<doc>/p-NNN.png + processing/ocr/<doc>/p-NNN.txt
│ Sonnet 4.6 via Claude Code subagents (scripts/28-batch-rebuild-all.py)
raw/<doc>--subagent/
├── _index.json
├── chunks/c0001.md ... c<NNNN>.md (bilingual EN+PT, bbox, anomaly flags)
├── images/IMG-c<NNNN>.png (crops with image-analyst description)
└── tables/TBL-NNN.csv (stitched multi-page tables)
│ scripts/30-index-chunks-to-db.py + embed-service
Postgres
├── public.documents (1 row per doc)
├── public.chunks (1 row per chunk; embedding vector(1024))
├── public.entities (1 row per canonical entity)
└── public.entity_mentions (chunk ↔ entity link, materialized by lint)
│ web/lib/retrieval/hybrid.ts → public.hybrid_search_chunks RPC + /rerank
Chat agent (OpenRouter) calls `hybrid_search` tool → cites [[doc/p007#c0042]]
│ /api/crop returns the bbox region
Frontend renders inline crop + bilingual text + link to original page
```
## Deploy
```bash
# 1. Build & ship embed-service image to VPS
cd infra/disclosure-stack
./scripts/bootstrap.sh # picks up new embed service + 0002 migration
# 2. Index after batch rebuild completes (or incrementally)
ssh vps "cd /data/disclosure && docker exec -i disclosure-db psql -U postgres \
< migrations/02-chunks-retrieval.sql"
# 3. Run indexer (on VPS, after embed-service is healthy)
cd /Users/guto/ufo
DATABASE_URL='postgres://...' EMBED_SERVICE_URL='http://localhost:8000' \
python3 scripts/30-index-chunks-to-db.py --skip-existing
```
## Performance budget (CPU-only VPS, 16GB RAM)
- BGE-M3 cold load: ~5-8 s; warm embed (single text): ~150-300 ms
- Embed batch of 16 chunks: ~800-1500 ms
- Indexing 100 chunks/doc × 115 docs = ~11,500 chunks → ~15-25 min total
- pgvector HNSW recall@100 from 150k chunks: <30 ms
- BGE-Reranker on 100 candidates: 5-8 s
- End-to-end chat query (recall + rerank + LLM): ~6-12 s
Tune later: switch reranker to batch of 50 if latency feels slow; use BGE-M3 fp16 if GPU available.