disclosure-bureau/infra/RETRIEVAL.md

3.5 KiB
Raw Blame History

Retrieval Pipeline — disclosure.top chunks layer

Hybrid retrieval over the agentic chunks (raw/--subagent/) using:

  • BGE-M3 dense embeddings (1024-dim, multilingual, self-hosted, gratis)
  • pgvector HNSW index (Postgres 15.8.1 in disclosure-stack, supabase image ships pgvector)
  • Postgres tsvector BM25 (pt_unaccent + en_unaccent configs)
  • BGE-Reranker-v2-M3 cross-encoder rerank (self-hosted)
  • RRF fusion of BM25 + dense → reranker → final top-k

Cost: ~$0/month after initial $0 setup. LLM stays on OpenRouter (deepseek-v4-flash:free or paid model of choice).

Components

Path Purpose
infra/embed-service/ Python FastAPI on CPU — BGE-M3 + reranker
infra/supabase/migrations/0002_chunks_retrieval.sql pgvector + tsvector + chunks/documents/entities tables + public.hybrid_search_chunks RPC
scripts/30-index-chunks-to-db.py Reads raw/<doc>--subagent/_index.json + chunks/c*.md, embeds via embed-service, UPSERTs to Postgres
web/lib/retrieval/ TS client (db.ts, embed.ts, hybrid.ts)
web/lib/chat/tools.ts hybrid_search, read_chunk, get_page_chunks, list_anomalies tools
web/app/api/crop/ On-demand bbox crop service (sharp) — used by chunk views + chat citations
web/app/d/[docId]/v2/ Rich render using chunks (inline images + tables + cite anchors)

End-to-end flow

PDFs (raw/*.pdf)
   │ pdftoppm 72 DPI + pdftotext
   ▼
processing/png/<doc>/p-NNN.png + processing/ocr/<doc>/p-NNN.txt
   │ Sonnet 4.6 via Claude Code subagents (scripts/28-batch-rebuild-all.py)
   ▼
raw/<doc>--subagent/
   ├── _index.json
   ├── chunks/c0001.md ... c<NNNN>.md  (bilingual EN+PT, bbox, anomaly flags)
   ├── images/IMG-c<NNNN>.png          (crops with image-analyst description)
   └── tables/TBL-NNN.csv              (stitched multi-page tables)
   │ scripts/30-index-chunks-to-db.py + embed-service
   ▼
Postgres
   ├── public.documents      (1 row per doc)
   ├── public.chunks         (1 row per chunk; embedding vector(1024))
   ├── public.entities       (1 row per canonical entity)
   └── public.entity_mentions (chunk ↔ entity link, materialized by lint)
   │ web/lib/retrieval/hybrid.ts → public.hybrid_search_chunks RPC + /rerank
   ▼
Chat agent (OpenRouter) calls `hybrid_search` tool → cites [[doc/p007#c0042]]
   │ /api/crop returns the bbox region
   ▼
Frontend renders inline crop + bilingual text + link to original page

Deploy

# 1. Build & ship embed-service image to VPS
cd infra/disclosure-stack
./scripts/bootstrap.sh         # picks up new embed service + 0002 migration

# 2. Index after batch rebuild completes (or incrementally)
ssh vps "cd /data/disclosure && docker exec -i disclosure-db psql -U postgres \
   < migrations/02-chunks-retrieval.sql"

# 3. Run indexer (on VPS, after embed-service is healthy)
cd /Users/guto/ufo
DATABASE_URL='postgres://...' EMBED_SERVICE_URL='http://localhost:8000' \
  python3 scripts/30-index-chunks-to-db.py --skip-existing

Performance budget (CPU-only VPS, 16GB RAM)

  • BGE-M3 cold load: ~5-8 s; warm embed (single text): ~150-300 ms
  • Embed batch of 16 chunks: ~800-1500 ms
  • Indexing 100 chunks/doc × 115 docs = ~11,500 chunks → ~15-25 min total
  • pgvector HNSW recall@100 from 150k chunks: <30 ms
  • BGE-Reranker on 100 candidates: 5-8 s
  • End-to-end chat query (recall + rerank + LLM): ~6-12 s

Tune later: switch reranker to batch of 50 if latency feels slow; use BGE-M3 fp16 if GPU available.