# embed-service — BGE-M3 + BGE-Reranker-v2-M3 microservice Self-hosted on the VPS, CPU-only. ~2.5 GB RAM in steady state. Powers hybrid retrieval over the chunks corpus. ## Endpoints - `POST /embed` — batch dense embedding (1024 dim, normalized cosine) - `POST /rerank` — cross-encoder rerank for candidate lists - `GET /health`, `GET /info` ## Resource expectations | Op | Cold | Warm | Notes | |---|---|---|---| | Embed 1 chunk (~400 tokens) | ~5s (load) | 100-200 ms | first request loads model | | Embed batch of 16 | — | 800-1500 ms | use during indexing | | Rerank 100 candidates | — | 5-8 s | called per query post-recall | ## Add to disclosure-stack ```yaml embed: build: ../embed-service restart: unless-stopped networks: [internal] environment: DEVICE: cpu EMBED_MODEL: BAAI/bge-m3 RERANK_MODEL: BAAI/bge-reranker-v2-m3 volumes: - hf-cache:/cache deploy: resources: limits: memory: 3g ``` ## Test locally ```bash docker build -t embed-service . docker run --rm -p 8000:8000 -v hf-cache:/cache embed-service curl -s http://localhost:8000/health curl -s -X POST http://localhost:8000/embed \ -H 'Content-Type: application/json' \ -d '{"texts":["UAP avistado sobre Olathe, Kansas em 6 de janeiro de 1950"]}' ``` Note: First request downloads model weights (~2.3 GB total). Subsequent requests hit the cache. Mount `hf-cache` as a named volume to persist across restarts.