disclosure-bureau/infra/embed-service
2026-05-17 22:44:36 -03:00
..
app.py baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00
Dockerfile baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00
README.md baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00
requirements.txt baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00

embed-service — BGE-M3 + BGE-Reranker-v2-M3 microservice

Self-hosted on the VPS, CPU-only. ~2.5 GB RAM in steady state. Powers hybrid retrieval over the chunks corpus.

Endpoints

  • POST /embed — batch dense embedding (1024 dim, normalized cosine)
  • POST /rerank — cross-encoder rerank for candidate lists
  • GET /health, GET /info

Resource expectations

Op Cold Warm Notes
Embed 1 chunk (~400 tokens) ~5s (load) 100-200 ms first request loads model
Embed batch of 16 800-1500 ms use during indexing
Rerank 100 candidates 5-8 s called per query post-recall

Add to disclosure-stack

embed:
  build: ../embed-service
  restart: unless-stopped
  networks: [internal]
  environment:
    DEVICE: cpu
    EMBED_MODEL: BAAI/bge-m3
    RERANK_MODEL: BAAI/bge-reranker-v2-m3
  volumes:
    - hf-cache:/cache
  deploy:
    resources:
      limits:
        memory: 3g

Test locally

docker build -t embed-service .
docker run --rm -p 8000:8000 -v hf-cache:/cache embed-service
curl -s http://localhost:8000/health
curl -s -X POST http://localhost:8000/embed \
  -H 'Content-Type: application/json' \
  -d '{"texts":["UAP avistado sobre Olathe, Kansas em 6 de janeiro de 1950"]}'

Note: First request downloads model weights (~2.3 GB total). Subsequent requests hit the cache. Mount hf-cache as a named volume to persist across restarts.