81 lines
3.5 KiB
Markdown
81 lines
3.5 KiB
Markdown
|
|
# Retrieval Pipeline — disclosure.top chunks layer
|
|||
|
|
|
|||
|
|
Hybrid retrieval over the agentic chunks (raw/<doc-id>--subagent/) using:
|
|||
|
|
|
|||
|
|
- **BGE-M3** dense embeddings (1024-dim, multilingual, self-hosted, gratis)
|
|||
|
|
- **pgvector HNSW** index (Postgres 15.8.1 in disclosure-stack, supabase image ships pgvector)
|
|||
|
|
- **Postgres tsvector** BM25 (`pt_unaccent` + `en_unaccent` configs)
|
|||
|
|
- **BGE-Reranker-v2-M3** cross-encoder rerank (self-hosted)
|
|||
|
|
- **RRF fusion** of BM25 + dense → reranker → final top-k
|
|||
|
|
|
|||
|
|
Cost: ~$0/month after initial $0 setup. LLM stays on OpenRouter (deepseek-v4-flash:free or paid model of choice).
|
|||
|
|
|
|||
|
|
## Components
|
|||
|
|
|
|||
|
|
| Path | Purpose |
|
|||
|
|
|---|---|
|
|||
|
|
| `infra/embed-service/` | Python FastAPI on CPU — BGE-M3 + reranker |
|
|||
|
|
| `infra/supabase/migrations/0002_chunks_retrieval.sql` | pgvector + tsvector + chunks/documents/entities tables + `public.hybrid_search_chunks` RPC |
|
|||
|
|
| `scripts/30-index-chunks-to-db.py` | Reads `raw/<doc>--subagent/_index.json + chunks/c*.md`, embeds via embed-service, UPSERTs to Postgres |
|
|||
|
|
| `web/lib/retrieval/` | TS client (db.ts, embed.ts, hybrid.ts) |
|
|||
|
|
| `web/lib/chat/tools.ts` | `hybrid_search`, `read_chunk`, `get_page_chunks`, `list_anomalies` tools |
|
|||
|
|
| `web/app/api/crop/` | On-demand bbox crop service (sharp) — used by chunk views + chat citations |
|
|||
|
|
| `web/app/d/[docId]/v2/` | Rich render using chunks (inline images + tables + cite anchors) |
|
|||
|
|
|
|||
|
|
## End-to-end flow
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
PDFs (raw/*.pdf)
|
|||
|
|
│ pdftoppm 72 DPI + pdftotext
|
|||
|
|
▼
|
|||
|
|
processing/png/<doc>/p-NNN.png + processing/ocr/<doc>/p-NNN.txt
|
|||
|
|
│ Sonnet 4.6 via Claude Code subagents (scripts/28-batch-rebuild-all.py)
|
|||
|
|
▼
|
|||
|
|
raw/<doc>--subagent/
|
|||
|
|
├── _index.json
|
|||
|
|
├── chunks/c0001.md ... c<NNNN>.md (bilingual EN+PT, bbox, anomaly flags)
|
|||
|
|
├── images/IMG-c<NNNN>.png (crops with image-analyst description)
|
|||
|
|
└── tables/TBL-NNN.csv (stitched multi-page tables)
|
|||
|
|
│ scripts/30-index-chunks-to-db.py + embed-service
|
|||
|
|
▼
|
|||
|
|
Postgres
|
|||
|
|
├── public.documents (1 row per doc)
|
|||
|
|
├── public.chunks (1 row per chunk; embedding vector(1024))
|
|||
|
|
├── public.entities (1 row per canonical entity)
|
|||
|
|
└── public.entity_mentions (chunk ↔ entity link, materialized by lint)
|
|||
|
|
│ web/lib/retrieval/hybrid.ts → public.hybrid_search_chunks RPC + /rerank
|
|||
|
|
▼
|
|||
|
|
Chat agent (OpenRouter) calls `hybrid_search` tool → cites [[doc/p007#c0042]]
|
|||
|
|
│ /api/crop returns the bbox region
|
|||
|
|
▼
|
|||
|
|
Frontend renders inline crop + bilingual text + link to original page
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Deploy
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Build & ship embed-service image to VPS
|
|||
|
|
cd infra/disclosure-stack
|
|||
|
|
./scripts/bootstrap.sh # picks up new embed service + 0002 migration
|
|||
|
|
|
|||
|
|
# 2. Index after batch rebuild completes (or incrementally)
|
|||
|
|
ssh vps "cd /data/disclosure && docker exec -i disclosure-db psql -U postgres \
|
|||
|
|
< migrations/02-chunks-retrieval.sql"
|
|||
|
|
|
|||
|
|
# 3. Run indexer (on VPS, after embed-service is healthy)
|
|||
|
|
cd /Users/guto/ufo
|
|||
|
|
DATABASE_URL='postgres://...' EMBED_SERVICE_URL='http://localhost:8000' \
|
|||
|
|
python3 scripts/30-index-chunks-to-db.py --skip-existing
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Performance budget (CPU-only VPS, 16GB RAM)
|
|||
|
|
|
|||
|
|
- BGE-M3 cold load: ~5-8 s; warm embed (single text): ~150-300 ms
|
|||
|
|
- Embed batch of 16 chunks: ~800-1500 ms
|
|||
|
|
- Indexing 100 chunks/doc × 115 docs = ~11,500 chunks → ~15-25 min total
|
|||
|
|
- pgvector HNSW recall@100 from 150k chunks: <30 ms
|
|||
|
|
- BGE-Reranker on 100 candidates: 5-8 s
|
|||
|
|
- End-to-end chat query (recall + rerank + LLM): ~6-12 s
|
|||
|
|
|
|||
|
|
Tune later: switch reranker to batch of 50 if latency feels slow; use BGE-M3 fp16 if GPU available.
|