50 lines
1.4 KiB
Markdown
50 lines
1.4 KiB
Markdown
# embed-service — BGE-M3 + BGE-Reranker-v2-M3 microservice
|
|
|
|
Self-hosted on the VPS, CPU-only. ~2.5 GB RAM in steady state. Powers hybrid retrieval over the chunks corpus.
|
|
|
|
## Endpoints
|
|
|
|
- `POST /embed` — batch dense embedding (1024 dim, normalized cosine)
|
|
- `POST /rerank` — cross-encoder rerank for candidate lists
|
|
- `GET /health`, `GET /info`
|
|
|
|
## Resource expectations
|
|
|
|
| Op | Cold | Warm | Notes |
|
|
|---|---|---|---|
|
|
| Embed 1 chunk (~400 tokens) | ~5s (load) | 100-200 ms | first request loads model |
|
|
| Embed batch of 16 | — | 800-1500 ms | use during indexing |
|
|
| Rerank 100 candidates | — | 5-8 s | called per query post-recall |
|
|
|
|
## Add to disclosure-stack
|
|
|
|
```yaml
|
|
embed:
|
|
build: ../embed-service
|
|
restart: unless-stopped
|
|
networks: [internal]
|
|
environment:
|
|
DEVICE: cpu
|
|
EMBED_MODEL: BAAI/bge-m3
|
|
RERANK_MODEL: BAAI/bge-reranker-v2-m3
|
|
volumes:
|
|
- hf-cache:/cache
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 3g
|
|
```
|
|
|
|
## Test locally
|
|
|
|
```bash
|
|
docker build -t embed-service .
|
|
docker run --rm -p 8000:8000 -v hf-cache:/cache embed-service
|
|
curl -s http://localhost:8000/health
|
|
curl -s -X POST http://localhost:8000/embed \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"texts":["UAP avistado sobre Olathe, Kansas em 6 de janeiro de 1950"]}'
|
|
```
|
|
|
|
Note: First request downloads model weights (~2.3 GB total). Subsequent requests
|
|
hit the cache. Mount `hf-cache` as a named volume to persist across restarts.
|