disclosure-bureau/infra/embed-service/README.md

# embed-service — BGE-M3 + BGE-Reranker-v2-M3 microservice

Self-hosted on the VPS, CPU-only. ~2.5 GB RAM in steady state. Powers hybrid retrieval over the chunks corpus.

## Endpoints

- `POST /embed` — batch dense embedding (1024 dim, normalized cosine)
- `POST /rerank` — cross-encoder rerank for candidate lists
- `GET /health`, `GET /info`

## Resource expectations

| Op | Cold | Warm | Notes |
|---|---|---|---|
| Embed 1 chunk (~400 tokens) | ~5s (load) | 100-200 ms | first request loads model |
| Embed batch of 16 | — | 800-1500 ms | use during indexing |
| Rerank 100 candidates | — | 5-8 s | called per query post-recall |

## Add to disclosure-stack

```yaml
embed:
  build: ../embed-service
  restart: unless-stopped
  networks: [internal]
  environment:
    DEVICE: cpu
    EMBED_MODEL: BAAI/bge-m3
    RERANK_MODEL: BAAI/bge-reranker-v2-m3
  volumes:
    - hf-cache:/cache
  deploy:
    resources:
      limits:
        memory: 3g
```

## Test locally

```bash
docker build -t embed-service .
docker run --rm -p 8000:8000 -v hf-cache:/cache embed-service
curl -s http://localhost:8000/health
curl -s -X POST http://localhost:8000/embed \
  -H 'Content-Type: application/json' \
  -d '{"texts":["UAP avistado sobre Olathe, Kansas em 6 de janeiro de 1950"]}'
```

Note: First request downloads model weights (~2.3 GB total). Subsequent requests
hit the cache. Mount `hf-cache` as a named volume to persist across restarts.
baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-18 01:44:36 +00:00			`# embed-service — BGE-M3 + BGE-Reranker-v2-M3 microservice`

			`Self-hosted on the VPS, CPU-only. ~2.5 GB RAM in steady state. Powers hybrid retrieval over the chunks corpus.`

			`## Endpoints`

			- `POST /embed` — batch dense embedding (1024 dim, normalized cosine)
			- `POST /rerank` — cross-encoder rerank for candidate lists
			- `GET /health`, `GET /info`

			`## Resource expectations`

			`\| Op \| Cold \| Warm \| Notes \|`
			`\|---\|---\|---\|---\|`
			`\| Embed 1 chunk (~400 tokens) \| ~5s (load) \| 100-200 ms \| first request loads model \|`
			`\| Embed batch of 16 \| — \| 800-1500 ms \| use during indexing \|`
			`\| Rerank 100 candidates \| — \| 5-8 s \| called per query post-recall \|`

			`## Add to disclosure-stack`

			```yaml
			`embed:`
			`build: ../embed-service`
			`restart: unless-stopped`
			`networks: [internal]`
			`environment:`
			`DEVICE: cpu`
			`EMBED_MODEL: BAAI/bge-m3`
			`RERANK_MODEL: BAAI/bge-reranker-v2-m3`
			`volumes:`
			`- hf-cache:/cache`
			`deploy:`
			`resources:`
			`limits:`
			`memory: 3g`
			```

			`## Test locally`

			```bash
			`docker build -t embed-service .`
			`docker run --rm -p 8000:8000 -v hf-cache:/cache embed-service`
			`curl -s http://localhost:8000/health`
			`curl -s -X POST http://localhost:8000/embed \`
			`-H 'Content-Type: application/json' \`
			`-d '{"texts":["UAP avistado sobre Olathe, Kansas em 6 de janeiro de 1950"]}'`
			```

			`Note: First request downloads model weights (~2.3 GB total). Subsequent requests`
			hit the cache. Mount `hf-cache` as a named volume to persist across restarts.