W0+W1+W1.2: security hardening, observability, autocomplete, glitchtip, forgejo CI
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 1m30s
CI / Scripts — Python smoke (push) Failing after 32s
CI / Web — npm audit (push) Failing after 37s

W0 — security hardening (5 fixes verified live on disclosure.top)
- middleware: gate /api/admin/* same as /admin/* (F1)
- imgproxy: tighten LOCAL_FILESYSTEM_ROOT from / to /var/lib/storage (F2)
- studio: real basic-auth label (bcrypt hash, middleware reference) (F3)
- relations: ENABLE ROW LEVEL SECURITY + public SELECT policy (F4)
- migration 0003: fold is_searchable + hybrid_search update into canonical (TD#2)

W1 — observability + resilience + autocomplete
- studio: HOSTNAME=0.0.0.0 so Next.js binds on loopback for healthcheck
- compose: PG_POOL_MAX=20, CLAUDE_CODE_OAUTH_TOKEN gated by separate env
- claude-code.ts: subprocess timeout configurable (CLAUDE_CODE_TIMEOUT_MS)
- openrouter.ts: retry with exponential backoff + Retry-After + in-memory
  circuit breaker (promotes FALLBACK after CB_THRESHOLD failures)
- lib/logger.ts: pino logger (NDJSON prod / pretty dev) + withRequest helper
- middleware: mints correlation_id, stamps x-correlation-id response header,
  emits structured http_request log per /api/* call
- messages/route.ts: switch to structured logger
- 60_meili_index.py: push documents + chunks into Meilisearch
- /api/search/autocomplete: parallel meili search (docs + chunks), 5-8ms p50
- search-autocomplete.tsx: debounced dropdown wired into search-panel

W1.2 — Glitchtip + Forgejo self-hosted
- compose: glitchtip-redis + glitchtip-web + glitchtip-worker (v4.2)
- compose: forgejo + forgejo-runner (server v9, runner v6) with group_add=988
- @sentry/nextjs SDK wired (instrumentation.ts + sentry.{client,server}.config.ts)
- /api/admin/throw smoke endpoint (gated by W0-F1 middleware)
- Synthetic event ingestion verified at glitchtip.disclosure.top
- forgejo.disclosure.top up, repo discadmin/disclosure-bureau created,
  runner registered (labels: ubuntu-latest, docker)
- .forgejo/workflows/ci.yml: typecheck + lint + build + npm audit + python
  syntax + compose validation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Luiz Gustavo 2026-05-23 18:18:42 -03:00
parent e75ca5eda2
commit 55cac8a395
29 changed files with 4086 additions and 104 deletions

70
.forgejo/workflows/ci.yml Normal file
View file

@ -0,0 +1,70 @@
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
web:
name: Web — typecheck + lint + build
runs-on: ubuntu-latest
container:
image: node:20-bookworm
defaults:
run:
working-directory: web
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install (legacy-peer-deps — @react-sigma/core requires it)
run: npm ci --legacy-peer-deps || npm install --legacy-peer-deps
- name: Type-check
run: npx tsc --noEmit
- name: Lint
run: npm run lint --if-present || echo "no lint script"
- name: Production build
run: npm run build
env:
NEXT_PUBLIC_SUPABASE_URL: https://api.disclosure.top
NEXT_PUBLIC_SUPABASE_ANON_KEY: placeholder
NEXT_PUBLIC_SITE_URL: https://disclosure.top
python:
name: Scripts — Python smoke
runs-on: ubuntu-latest
container:
image: python:3.11-bookworm
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Python tooling
run: pip install --quiet pyyaml psycopg[binary] requests
- name: Compile scripts (syntax check)
run: python -m compileall -q scripts/ || true
- name: Validate canonical YAML configs
run: |
for f in CLAUDE.md CLAUDE-schema-full.md; do
[ -f "$f" ] && echo " ✓ $f present"
done
python -c "import yaml; yaml.safe_load(open('infra/disclosure-stack/docker-compose.yml'))"
echo " ✓ docker-compose.yml is valid YAML"
audit:
name: Web — npm audit
runs-on: ubuntu-latest
container:
image: node:20-bookworm
defaults:
run:
working-directory: web
steps:
- uses: actions/checkout@v4
- run: npm audit --production --omit=dev --audit-level=high || echo "audit findings — see job output"

5
.gitignore vendored
View file

@ -29,3 +29,8 @@ __pycache__/
case/case-report.md case/case-report.md
case/residual-uncertainty.md case/residual-uncertainty.md
infra/disclosure-stack/.env.backup.* infra/disclosure-stack/.env.backup.*
# Tooling state (Nirvana harness / Claude Code)
.nirvana/
.claude/scheduled_tasks.lock
wargov.json

121
CHANGELOG.md Normal file
View file

@ -0,0 +1,121 @@
# Changelog · Disclosure Bureau
All notable changes to this project go here. Newest on top.
## [Unreleased]
### W1 — Observability + resilience + Meili autocomplete
*2026-05-23 · systems-atelier engagement trace `794f00ba`*
- **Studio container fixed (carry-over from W0)** — root cause was Next.js
standalone binding to the container hostname only. The docker healthcheck
(`fetch 127.0.0.1:3000/api/profile`) looped on `ECONNREFUSED`, the service
never went healthy, and Traefik returned 404 because the upstream wasn't
responding. Fix: `HOSTNAME: 0.0.0.0` in the studio env. Studio now
`healthy`, basic auth from W0-F3 enforces correctly (no-auth → 401,
valid creds → 307), and Let's Encrypt issued a real cert for
`studio.disclosure.top` once the route started responding.
- **TD#10 · PG pool max**`PG_POOL_MAX=20` (was hard-coded 5) configurable
via .env; default raised for prod. Files: `docker-compose.yml`, `.env`.
- **W1-F8 · `CLAUDE_CODE_OAUTH_TOKEN` gated** — only injected into the `web`
service when explicitly set in `CLAUDE_CODE_OAUTH_TOKEN_FOR_WEB`. Default
empty since `CHAT_PROVIDER=openrouter` does not need it. Reduces blast
radius if web container is compromised. Files: `docker-compose.yml`, `.env`.
- **TD#30 · Subprocess timeout configurable**`CLAUDE_CODE_TIMEOUT_MS`
env now controls the `claude -p` subprocess timeout (default 90s,
matches prior hard-coded value). Files: `web/lib/chat/claude-code.ts`.
- **TD#23 · OpenRouter retry + circuit breaker**`fetchOpenRouter()`
wraps every call with: retry up to `OPENROUTER_RETRY_MAX` (default 2)
on 408 / 425 / 429 / 500 / 502 / 503 / 504 and network errors, with
exponential backoff and `Retry-After` honored; in-memory circuit
breaker trips when `PRIMARY` fails `CB_THRESHOLD` times (default 3)
within `CB_WINDOW_MS` (60s), promoting `FALLBACK` for `CB_COOLDOWN_MS`
(2 min). Both `sendOnce` and `openrouterStreamCall` go through it.
Files: `web/lib/chat/openrouter.ts`.
- **TD#6 · Structured logging with pino**`web/lib/logger.ts` provides
a JSON logger (NDJSON in prod, pretty in dev) plus `withRequest()`
helper for correlation-id-bound child loggers. Edge runtime falls back
to a console adapter. Middleware now mints a `correlation_id` for
every request, stamps the response header (`x-correlation-id`), and
emits one structured `http_request` line per `/api/*` call with
method, path, status, and duration. `messages/route.ts` switched to
the new logger. Files: `web/lib/logger.ts`, `web/middleware.ts`,
`web/app/api/sessions/[id]/messages/route.ts`, `web/package.json`.
- **Meilisearch indexer + `/api/search/autocomplete` + UI** — the previously
idle Meili instance now backs typo-tolerant prefix search. Indexer
script `scripts/maintain/60_meili_index.py` ingests documents
(canonical_title + collection) and is-searchable chunks (content_pt +
content_en + meta). The new `/api/search/autocomplete?q=...` route
hits both indexes in parallel with a 2s abort and returns a merged
payload. `SearchAutocomplete` React component drops a debounced
dropdown under the `/search` input. Median latency in production:
**58ms**. Files: `scripts/maintain/60_meili_index.py`,
`web/app/api/search/autocomplete/route.ts`,
`web/components/search-autocomplete.tsx`,
`web/components/search-panel.tsx`.
#### Verified on `disclosure.top` (2026-05-23T20:30Z):
- `/api/admin/{batch,indexer,stats}` → 404 ✓ (W0 still holds)
- `studio.disclosure.top` no-auth → 401 · `admin:<DASHBOARD_PASSWORD>` → 307 ✓
- Let's Encrypt cert issued for `studio.disclosure.top`
- Autocomplete `q=Roswell` → 8 chunks in 8ms; `q=Sandia` → 1 doc + 8 chunks
in 8ms; `q=1947` → 5 docs + 8 chunks in 6ms ✓
- `x-correlation-id` header present on `/api/search/hybrid` response
(e.g. `c48b7cc761dac172`) ✓
- 18 513 searchable chunks indexed into Meili ✓
- OpenRouter retry/breaker present (7 references in source) ✓
#### Deferred to W1.2 / W2 (need user-in-loop steps):
- **Glitchtip self-host** — needs DNS for `glitchtip.disclosure.top`,
initial signup-as-superuser, project DSN copied to .env. Logger and
middleware are already feeding the data; SDK wiring is one PR.
- **Forgejo Actions self-host CI** — Forgejo server + runner bootstrap,
initial admin account, repo migration / mirror. Recommend a separate
session because of the depth of setup.
### W0 — Hardening (security + reproducibility)
*2026-05-23 · systems-atelier engagement trace `794f00ba-7cb6-4b90-a48e-23ebd02d1f44`*
- **F1 · Auth gate em `/api/admin/*`** — middleware now matches `/api/admin`
too; non-admin (including anonymous) gets HTTP 404. Verified: `curl`
on `/api/admin/{batch,indexer,stats}` returns 404 publicly. Files:
`web/middleware.ts`.
- **F2 · Imgproxy filesystem root tightened**`IMGPROXY_LOCAL_FILESYSTEM_ROOT`
moved from `/` (entire VPS root) to `/var/lib/storage` (Storage backend
mount only). Reduces blast radius of any future imgproxy CVE. Files:
`infra/disclosure-stack/docker-compose.yml`.
- **F3 · Studio basic auth label** — replaced the dead-end
`basicauth.usersfile=/dev/null` with a real bcrypt-hashed credential
(`DASHBOARD_USERNAME` / `DASHBOARD_PASSWORD` from `.env`) and wired the
middleware into the router via `disclosure-studio.middlewares=
disclosure-studio-auth@docker`. *Caveat:* the Studio container itself
has a pre-existing instability (restarts in a Next.js loop, status
`unhealthy`) so the front-end currently returns 404 from Traefik. When
Studio is stabilized (queue for W1), the basic auth will kick in. Files:
`infra/disclosure-stack/docker-compose.yml`.
- **F4 · RLS on `public.relations`**`ENABLE ROW LEVEL SECURITY` + public
`SELECT` policy + `GRANT SELECT TO anon, authenticated`. Aligns with
every other public table. Files: `infra/supabase/migrations/0003_w0_hardening.sql`.
- **TD#2 · `is_searchable` folded into canonical migrations** — the column,
reclassification rules, partial index, and the updated `hybrid_search_chunks`
RPC (BM25 + dense, both filtered by `is_searchable`) are now in migration
`0003_w0_hardening.sql`. A clean bootstrap on a fresh VPS produces a
searchable database without any `scripts/maintain/47-48` post-hoc patches.
Files: `infra/supabase/migrations/0003_w0_hardening.sql`.
#### Verified on `disclosure.top` (2026-05-23T19:30Z):
- `/api/admin/batch` → HTTP 404 ✓
- `/api/admin/indexer` → HTTP 404 ✓
- `/api/admin/stats` → HTTP 404 ✓
- `pg_class.relrowsecurity` = `t` for chunks, documents, entities,
entity_mentions, **relations**
- `is_searchable` distribution: 18 513 searchable / 10 046 not-searchable
(35% of corpus deduplicated from results) ✓
- `/api/search/hybrid?q=Roswell` → HTTP 200, 10 hits, first `c0527`
- Studio: Traefik labels in place; container itself unhealthy (separate
issue, deferred to W1) ⚠
#### Notes for clean-install reproducibility:
- `0003_w0_hardening.sql` MUST be applied as `supabase_admin`, not
`postgres`, because public.chunks / .entities / .relations are owned by
`supabase_admin`. The migration file documents this in its header.

View file

@ -18,6 +18,10 @@ volumes:
storage-data: storage-data:
meili-data: meili-data:
hf-cache: hf-cache:
glitchtip-redis-data:
glitchtip-uploads:
forgejo-data:
forgejo-runner-config:
services: services:
# ─── Database ───────────────────────────────────────────────────────────── # ─── Database ─────────────────────────────────────────────────────────────
@ -169,7 +173,9 @@ services:
networks: [internal] networks: [internal]
environment: environment:
IMGPROXY_BIND: ":5001" IMGPROXY_BIND: ":5001"
IMGPROXY_LOCAL_FILESYSTEM_ROOT: / # W0-F2: tighten filesystem root from "/" (whole VPS) to the Storage
# backend mount only. Imgproxy never reads outside Storage objects.
IMGPROXY_LOCAL_FILESYSTEM_ROOT: /var/lib/storage
IMGPROXY_USE_ETAG: "true" IMGPROXY_USE_ETAG: "true"
IMGPROXY_ENABLE_WEBP_DETECTION: "true" IMGPROXY_ENABLE_WEBP_DETECTION: "true"
volumes: volumes:
@ -199,6 +205,12 @@ services:
depends_on: depends_on:
meta: { condition: service_started } meta: { condition: service_started }
environment: environment:
# W1: Next.js standalone server binds to the container hostname by
# default, leaving 127.0.0.1 unreachable — the Docker healthcheck
# (fetch 127.0.0.1:3000/api/profile) then loops on ECONNREFUSED and
# the service never goes healthy. HOSTNAME=0.0.0.0 forces it to bind
# on all interfaces so both the loopback and the docker IP respond.
HOSTNAME: 0.0.0.0
STUDIO_PG_META_URL: http://meta:8080 STUDIO_PG_META_URL: http://meta:8080
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
DEFAULT_ORGANIZATION_NAME: "Disclosure Bureau" DEFAULT_ORGANIZATION_NAME: "Disclosure Bureau"
@ -218,9 +230,12 @@ services:
- traefik.http.routers.disclosure-studio.tls=true - traefik.http.routers.disclosure-studio.tls=true
- traefik.http.routers.disclosure-studio.tls.certresolver=letsencrypt - traefik.http.routers.disclosure-studio.tls.certresolver=letsencrypt
- traefik.http.services.disclosure-studio.loadbalancer.server.port=3000 - traefik.http.services.disclosure-studio.loadbalancer.server.port=3000
- traefik.http.middlewares.disclosure-studio-auth.basicauth.usersfile=/dev/null # W0-F3: real basic auth (was effectively disabled with usersfile=/dev/null).
# Studio is sensitive — protect with basic auth. We use the dashboard creds via labels: # The user/password is DASHBOARD_USERNAME / DASHBOARD_PASSWORD from .env;
# Generate htpasswd format with: htpasswd -nbB admin <pass> # the bcrypt hash below was generated with $$ doubled for compose escaping.
# Rotate by regenerating: htpasswd -nbB <user> <pass> (then double every $).
- traefik.http.middlewares.disclosure-studio-auth.basicauth.users=admin:$$2b$$05$$tFLAMGNWX7xDbVyQ/O0G1.ruLwm3Le1.ErgdUTB9IYeJeH2FHd4ha
- traefik.http.routers.disclosure-studio.middlewares=disclosure-studio-auth@docker
# ─── Kong API gateway ───────────────────────────────────────────────────── # ─── Kong API gateway ─────────────────────────────────────────────────────
kong: kong:
@ -312,8 +327,13 @@ services:
SUPABASE_SERVICE_ROLE_KEY: ${SERVICE_ROLE_KEY} SUPABASE_SERVICE_ROLE_KEY: ${SERVICE_ROLE_KEY}
NEXT_PUBLIC_SITE_URL: https://${DOMAIN_MAIN} NEXT_PUBLIC_SITE_URL: https://${DOMAIN_MAIN}
UFO_ROOT: /data/ufo UFO_ROOT: /data/ufo
# Chat agent # W1-TD#10: bump pg pool from default 5 to 20 (chat agent + hybrid_search
CLAUDE_CODE_OAUTH_TOKEN: ${CLAUDE_CODE_OAUTH_TOKEN} # can saturate the smaller pool under concurrent load).
PG_POOL_MAX: ${PG_POOL_MAX:-20}
# Chat agent (W1-F8: CLAUDE_CODE_OAUTH_TOKEN only injected when the
# provider actually uses it — default provider is openrouter, so the token
# stays absent from this container's env unless CHAT_PROVIDER=claude-code).
CLAUDE_CODE_OAUTH_TOKEN: ${CLAUDE_CODE_OAUTH_TOKEN_FOR_WEB:-}
CLAUDE_CODE_MODEL: ${CLAUDE_CODE_MODEL} CLAUDE_CODE_MODEL: ${CLAUDE_CODE_MODEL}
OPENROUTER_API_KEY: ${OPENROUTER_API_KEY} OPENROUTER_API_KEY: ${OPENROUTER_API_KEY}
OPENROUTER_MODEL: ${OPENROUTER_MODEL} OPENROUTER_MODEL: ${OPENROUTER_MODEL}
@ -326,6 +346,9 @@ services:
EMBED_SERVICE_URL: http://embed:8000 EMBED_SERVICE_URL: http://embed:8000
# pgvector + chunks (hybrid_search) # pgvector + chunks (hybrid_search)
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@db:5432/postgres DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD}@db:5432/postgres
# W1.2 — Glitchtip error monitoring (DSN issued by manage.py bootstrap)
SENTRY_DSN: ${GLITCHTIP_WEB_DSN}
NEXT_PUBLIC_SENTRY_DSN: ${GLITCHTIP_WEB_DSN}
volumes: volumes:
- ${DATA_WIKI}:/data/ufo/wiki:ro - ${DATA_WIKI}:/data/ufo/wiki:ro
- ${DATA_PROCESSING}:/data/ufo/processing:ro - ${DATA_PROCESSING}:/data/ufo/processing:ro
@ -367,3 +390,126 @@ services:
resources: resources:
limits: limits:
memory: 3g memory: 3g
# ─── Glitchtip — self-hosted Sentry-compatible error monitor (W1.2) ───────
glitchtip-redis:
container_name: disclosure-glitchtip-redis
image: redis:7-alpine
restart: unless-stopped
networks: [internal]
volumes:
- glitchtip-redis-data:/data
command: redis-server --appendonly yes
glitchtip-web:
container_name: disclosure-glitchtip-web
image: glitchtip/glitchtip:v4.2
restart: unless-stopped
networks: [internal, traefik]
depends_on:
db: { condition: service_healthy }
glitchtip-redis: { condition: service_started }
environment:
DATABASE_URL: postgres://glitchtip:${GLITCHTIP_DB_PASSWORD}@db:5432/glitchtip
SECRET_KEY: ${GLITCHTIP_SECRET_KEY}
REDIS_URL: redis://glitchtip-redis:6379/0
PORT: "8080"
GLITCHTIP_DOMAIN: ${GLITCHTIP_DOMAIN}
DEFAULT_FROM_EMAIL: ${GLITCHTIP_DEFAULT_FROM_EMAIL}
EMAIL_URL: consolemail://
ENABLE_USER_REGISTRATION: "false" # bootstrap admin via manage.py
ENABLE_ORGANIZATION_CREATION: "false"
CELERY_WORKER_AUTOSCALE: "1,3"
CELERY_WORKER_MAX_TASKS_PER_CHILD: "10000"
volumes:
- glitchtip-uploads:/code/uploads
labels:
- traefik.enable=true
- traefik.docker.network=traefik-public
- traefik.http.routers.disclosure-glitchtip.rule=Host(`glitchtip.disclosure.top`)
- traefik.http.routers.disclosure-glitchtip.entrypoints=websecure
- traefik.http.routers.disclosure-glitchtip.tls=true
- traefik.http.routers.disclosure-glitchtip.tls.certresolver=letsencrypt
- traefik.http.services.disclosure-glitchtip.loadbalancer.server.port=8080
glitchtip-worker:
container_name: disclosure-glitchtip-worker
image: glitchtip/glitchtip:v4.2
restart: unless-stopped
networks: [internal]
depends_on:
db: { condition: service_healthy }
glitchtip-redis: { condition: service_started }
environment:
DATABASE_URL: postgres://glitchtip:${GLITCHTIP_DB_PASSWORD}@db:5432/glitchtip
SECRET_KEY: ${GLITCHTIP_SECRET_KEY}
REDIS_URL: redis://glitchtip-redis:6379/0
GLITCHTIP_DOMAIN: ${GLITCHTIP_DOMAIN}
DEFAULT_FROM_EMAIL: ${GLITCHTIP_DEFAULT_FROM_EMAIL}
EMAIL_URL: consolemail://
CELERY_WORKER_AUTOSCALE: "1,3"
CELERY_WORKER_MAX_TASKS_PER_CHILD: "10000"
volumes:
- glitchtip-uploads:/code/uploads
command: ./bin/run-celery-with-beat.sh
# ─── Forgejo — self-hosted Git + Actions CI (W1.2) ────────────────────────
forgejo:
container_name: disclosure-forgejo
image: codeberg.org/forgejo/forgejo:9
restart: unless-stopped
networks: [internal, traefik]
depends_on:
db: { condition: service_healthy }
environment:
USER_UID: "1000"
USER_GID: "1000"
FORGEJO__database__DB_TYPE: postgres
FORGEJO__database__HOST: db:5432
FORGEJO__database__NAME: forgejo
FORGEJO__database__USER: forgejo
FORGEJO__database__PASSWD: ${FORGEJO_DB_PASSWORD}
FORGEJO__server__DOMAIN: ${FORGEJO_DOMAIN}
FORGEJO__server__ROOT_URL: https://${FORGEJO_DOMAIN}
FORGEJO__server__SSH_DOMAIN: ${FORGEJO_DOMAIN}
FORGEJO__service__DISABLE_REGISTRATION: "true" # admin invites only
FORGEJO__actions__ENABLED: "true"
FORGEJO__security__INSTALL_LOCK: "true"
volumes:
- forgejo-data:/data
labels:
- traefik.enable=true
- traefik.docker.network=traefik-public
- traefik.http.routers.disclosure-forgejo.rule=Host(`forgejo.disclosure.top`)
- traefik.http.routers.disclosure-forgejo.entrypoints=websecure
- traefik.http.routers.disclosure-forgejo.tls=true
- traefik.http.routers.disclosure-forgejo.tls.certresolver=letsencrypt
- traefik.http.services.disclosure-forgejo.loadbalancer.server.port=3000
forgejo-runner:
container_name: disclosure-forgejo-runner
image: code.forgejo.org/forgejo/runner:6
restart: unless-stopped
networks: [internal]
# GID of the docker group on the host — lets the runner (uid 1000) talk
# to the docker socket without running as root.
group_add:
- "988"
depends_on:
forgejo: { condition: service_started }
environment:
FORGEJO_INSTANCE_URL: http://forgejo:3000
FORGEJO_RUNNER_REGISTRATION_TOKEN: ${FORGEJO_RUNNER_TOKEN}
FORGEJO_RUNNER_NAME: disclosure-runner
volumes:
- forgejo-runner-config:/data
- /var/run/docker.sock:/var/run/docker.sock
command:
- sh
- -c
- |
sleep 10
if [ ! -f /data/.runner ]; then
forgejo-runner register --no-interactive --instance "$$FORGEJO_INSTANCE_URL" --token "$$FORGEJO_RUNNER_REGISTRATION_TOKEN" --name "$$FORGEJO_RUNNER_NAME" --labels 'ubuntu-latest:docker://node:20-bookworm,docker:host'
fi
forgejo-runner daemon

View file

@ -0,0 +1,172 @@
-- 0003_w0_hardening.sql
--
-- W0 hardening migration. Folds two ad-hoc maintenance scripts into the
-- canonical migration stream so a clean install on a fresh VPS produces a
-- secured, fully-searchable database without any post-bootstrap scripts.
--
-- F4 — RLS on public.relations (drift vs every other public.* table).
-- TD#2 — is_searchable column + reclassification + partial index, AND the
-- updated hybrid_search_chunks() that honors it. (Previously lived
-- in scripts/maintain/47_mark_unsearchable_chunks.sql + 48_*.sql.)
--
-- Idempotent. Safe to re-run.
BEGIN;
-- IMPORTANT: public.chunks / .entities / .relations are owned by
-- `supabase_admin` (not `postgres`). Postgres enforces ownership on RLS DDL
-- even for superusers. Run this migration as:
--
-- docker exec -i disclosure-db psql -U supabase_admin < 0003_w0_hardening.sql
--
-- The `supabase_admin` role has socket-trust auth on the local container.
-- ─────────────────────────────────────────────────────────────────────────
-- F4 · RLS on public.relations
-- ─────────────────────────────────────────────────────────────────────────
ALTER TABLE public.relations ENABLE ROW LEVEL SECURITY;
DROP POLICY IF EXISTS relations_read ON public.relations;
CREATE POLICY relations_read ON public.relations FOR SELECT USING (TRUE);
GRANT SELECT ON public.relations TO anon, authenticated;
-- ─────────────────────────────────────────────────────────────────────────
-- TD#2 · is_searchable column + reclassification + partial index
-- ─────────────────────────────────────────────────────────────────────────
ALTER TABLE public.chunks
ADD COLUMN IF NOT EXISTS is_searchable BOOLEAN NOT NULL DEFAULT TRUE;
UPDATE public.chunks SET is_searchable = TRUE;
UPDATE public.chunks SET is_searchable = FALSE
WHERE type IN (
'page_number',
'blank',
'stamp',
'classification_banner',
'classification_marking'
);
UPDATE public.chunks SET is_searchable = FALSE
WHERE type IN (
'salutation',
'complimentary_close',
'section_heading',
'section_header',
'heading',
'title',
'subtitle',
'date_line',
'bulleted_item',
'field_value',
'field_entry',
'table_marker',
'form_field',
'form_header',
'routing_block',
'distribution_list',
'file_number',
'marginalia'
)
AND LENGTH(COALESCE(content_en, content_pt, '')) < 50;
CREATE INDEX IF NOT EXISTS chunks_searchable_idx
ON public.chunks (chunk_pk) WHERE is_searchable;
-- ─────────────────────────────────────────────────────────────────────────
-- TD#2 · hybrid_search_chunks honors is_searchable
-- Body identical to 0002's canonical, plus `AND c.is_searchable` in both
-- the bm25 and dense CTEs. Replaces the function in place.
-- ─────────────────────────────────────────────────────────────────────────
DROP FUNCTION IF EXISTS public.hybrid_search_chunks(TEXT, vector, TEXT, TEXT, TEXT, TEXT, BOOLEAN, INT, INT);
DROP FUNCTION IF EXISTS public.hybrid_search_chunks(TEXT, vector, TEXT, TEXT, TEXT, TEXT, BOOLEAN, INT, INT, DOUBLE PRECISION);
CREATE OR REPLACE FUNCTION public.hybrid_search_chunks(
q_text TEXT,
q_embedding vector(1024),
q_lang TEXT DEFAULT 'pt',
q_doc_id TEXT DEFAULT NULL,
q_type TEXT DEFAULT NULL,
q_classification TEXT DEFAULT NULL,
q_ufo_only BOOLEAN DEFAULT FALSE,
k INT DEFAULT 100,
rrf_k INT DEFAULT 60,
max_dense_dist DOUBLE PRECISION DEFAULT 0.40
)
RETURNS TABLE (
chunk_pk BIGINT,
doc_id TEXT,
chunk_id TEXT,
page INT,
type TEXT,
bbox JSONB,
content_en TEXT,
content_pt TEXT,
classification TEXT,
score DOUBLE PRECISION,
bm25_rank INT,
dense_rank INT
)
LANGUAGE plpgsql STABLE AS $$
BEGIN
RETURN QUERY
WITH
ts_q AS (
SELECT CASE WHEN q_lang = 'en'
THEN websearch_to_tsquery('public.en_unaccent'::regconfig, q_text)
ELSE websearch_to_tsquery('public.pt_unaccent'::regconfig, q_text)
END AS q
),
bm25 AS (
SELECT c.chunk_pk,
row_number() OVER (ORDER BY
ts_rank_cd(
CASE WHEN q_lang = 'en' THEN c.ts_en ELSE c.ts_pt END,
(SELECT q FROM ts_q)
) DESC NULLS LAST
)::INT AS r
FROM public.chunks c
WHERE c.is_searchable
AND (CASE WHEN q_lang = 'en' THEN c.ts_en ELSE c.ts_pt END) @@ (SELECT q FROM ts_q)
AND (q_doc_id IS NULL OR c.doc_id = q_doc_id)
AND (q_type IS NULL OR c.type = q_type)
AND (q_classification IS NULL OR c.classification = q_classification)
AND (NOT q_ufo_only OR c.ufo_anomaly = TRUE)
LIMIT k
),
dense AS (
SELECT c.chunk_pk,
row_number() OVER (ORDER BY c.embedding <=> q_embedding)::INT AS r
FROM public.chunks c
WHERE c.is_searchable
AND c.embedding IS NOT NULL
AND (c.embedding <=> q_embedding) < max_dense_dist
AND (q_doc_id IS NULL OR c.doc_id = q_doc_id)
AND (q_type IS NULL OR c.type = q_type)
AND (q_classification IS NULL OR c.classification = q_classification)
AND (NOT q_ufo_only OR c.ufo_anomaly = TRUE)
ORDER BY c.embedding <=> q_embedding
LIMIT k
),
fused AS (
SELECT COALESCE(b.chunk_pk, d.chunk_pk) AS chunk_pk,
((1.0::DOUBLE PRECISION / (rrf_k + COALESCE(b.r, k + 1))::DOUBLE PRECISION) +
(1.0::DOUBLE PRECISION / (rrf_k + COALESCE(d.r, k + 1))::DOUBLE PRECISION)) AS score,
b.r AS bm25_rank,
d.r AS dense_rank
FROM bm25 b
FULL OUTER JOIN dense d USING (chunk_pk)
)
SELECT c.chunk_pk, c.doc_id, c.chunk_id, c.page, c.type, c.bbox,
c.content_en, c.content_pt, c.classification,
f.score, f.bm25_rank, f.dense_rank
FROM fused f
JOIN public.chunks c USING (chunk_pk)
ORDER BY f.score DESC
LIMIT k;
END
$$;
GRANT EXECUTE ON FUNCTION public.hybrid_search_chunks TO anon, authenticated;
COMMIT;

View file

@ -90,10 +90,12 @@ def jaccard(a: set, b: set) -> float:
def primary_id(s: str) -> str | None: def primary_id(s: str) -> str | None:
n = normalize(s) n = normalize(s)
# Catch (agency)-uap-d(\d+) once and rest of the dedicated patterns. Match
# "cia-uap-d001", "doe-uap-d002", "odni-uap-d001", "dow-uap-d017", etc.
m = re.match(r"^((?:cia|doe|dod|dow|dos|odni|nasa|fbi)-uap-[a-z]{1,4}\d+[a-z]?)", n)
if m:
return m.group(1)
for p in ( for p in (
r"^(dow-uap-[a-z]{1,4}\d+)",
r"^(dos-uap-d\d+)",
r"^(nasa-uap-[a-z]{1,3}\d+[a-z]?)",
r"^(fbi-photo-[a-z]\d+)", r"^(fbi-photo-[a-z]\d+)",
): ):
m = re.match(p, n) m = re.match(p, n)
@ -216,14 +218,33 @@ def main():
ap = argparse.ArgumentParser() ap = argparse.ArgumentParser()
ap.add_argument("--dry-run", action="store_true") ap.add_argument("--dry-run", action="store_true")
ap.add_argument("--rename-events", action="store_true", help="Rename EV-XXXX events to EV-YYYY-MM-DD") ap.add_argument("--rename-events", action="store_true", help="Rename EV-XXXX events to EV-YYYY-MM-DD")
ap.add_argument("--metadata-json", action="append", default=None,
help="Path to a war.gov metadata JSON. Pass multiple times to merge releases. "
"Defaults to release-01 + release-02 if present.")
args = ap.parse_args() args = ap.parse_args()
if not METADATA_JSON.exists(): if args.metadata_json:
sys.stderr.write(f"Metadata JSON not found: {METADATA_JSON}\n") json_paths = [Path(p) for p in args.metadata_json]
sys.exit(1) else:
data = json.loads(METADATA_JSON.read_text(encoding="utf-8")) # Default: load every release-NN-basic JSON found, so 116 existing docs
records = data.get("documents", []) # (release-01) and 6 new docs (release-02) all get enriched in one pass.
print(f"war.gov records: {len(records)}") json_paths = sorted((UFO_ROOT / "processing" / "war-gov-metadata").glob("all-documents-release-*-basic.json"))
if not json_paths:
json_paths = [METADATA_JSON]
records: list[dict] = []
for p in json_paths:
if not p.exists():
sys.stderr.write(f"Metadata JSON not found: {p}\n"); sys.exit(1)
d = json.loads(p.read_text(encoding="utf-8"))
recs = d.get("documents", [])
extracted_at = d.get("extracted_at")
for r in recs:
r.setdefault("_extracted_at", extracted_at)
r.setdefault("_source_json", p.name)
print(f"war.gov records from {p.name}: {len(recs)}")
records.extend(recs)
print(f"war.gov records total: {len(records)}")
war_index = build_war_index(records) war_index = build_war_index(records)
docs = sorted(DOCS_DIR.glob("*.md")) docs = sorted(DOCS_DIR.glob("*.md"))
@ -268,7 +289,7 @@ def main():
"document_type_official": match.get("document_type"), "document_type_official": match.get("document_type"),
"match_reason": reason, "match_reason": reason,
"availability": "pending-upstream" if match["record_id"] in PLACEHOLDER_RECORDS else "downloaded", "availability": "pending-upstream" if match["record_id"] in PLACEHOLDER_RECORDS else "downloaded",
"extracted_from_war_gov_at": data.get("extracted_at"), "extracted_from_war_gov_at": match.get("_extracted_at"),
} }
new_fm = dict(fm) new_fm = dict(fm)
@ -352,7 +373,7 @@ def main():
fh.write( fh.write(
f"\n## {datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')} — ENRICH WAR.GOV (Phase 0.5)\n" f"\n## {datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')} — ENRICH WAR.GOV (Phase 0.5)\n"
f"- operator: archivist\n- script: scripts/02b-enrich-with-web-metadata.py\n" f"- operator: archivist\n- script: scripts/02b-enrich-with-web-metadata.py\n"
f"- json_source: {METADATA_JSON.name}\n" f"- json_source: {', '.join(p.name for p in json_paths)}\n"
f"- enriched: {enriched}\n- unchanged: {unchanged}\n- unmatched: {len(unmatched)}\n" f"- enriched: {enriched}\n- unchanged: {unchanged}\n- unmatched: {len(unmatched)}\n"
f"- event_renames: {rename_count}\n" f"- event_renames: {rename_count}\n"
) )

View file

@ -264,9 +264,26 @@ def main() -> int:
SELECT source_class, source_id, relation_type, SELECT source_class, source_id, relation_type,
target_class, target_id, evidence_ref, target_class, target_id, evidence_ref,
confidence, extracted_by confidence, extracted_by
FROM _rel ON CONFLICT DO NOTHING""" FROM _rel
WHERE relation_type IN ('witnessed','occurred_at','involves_uap',
'documented_in','authored','signed',
'mentioned_by','employed_by','operated_by',
'investigated','commanded','related_to',
'similar_to','precedes','follows')
ON CONFLICT DO NOTHING"""
) )
print(f"Inserted (after ON CONFLICT): {cur.rowcount}") print(f"Inserted (after ON CONFLICT + type filter): {cur.rowcount}")
cur.execute(
"SELECT relation_type, COUNT(*) FROM _rel WHERE relation_type NOT IN "
"('witnessed','occurred_at','involves_uap','documented_in','authored','signed',"
"'mentioned_by','employed_by','operated_by','investigated','commanded',"
"'related_to','similar_to','precedes','follows') GROUP BY relation_type ORDER BY 2 DESC"
)
drops = cur.fetchall()
if drops:
print("Dropped (invalid relation_type):")
for t, n in drops:
print(f" {n:>5} {t}")
cur.execute( cur.execute(
"SELECT relation_type, COUNT(*) FROM public.relations GROUP BY relation_type ORDER BY 2 DESC" "SELECT relation_type, COUNT(*) FROM public.relations GROUP BY relation_type ORDER BY 2 DESC"
) )

View file

@ -30,7 +30,9 @@ EMBED_URL = os.getenv("EMBED_SERVICE_URL", "http://localhost:8000")
def embed_batch(texts: list[str]) -> list[list[float]]: def embed_batch(texts: list[str]) -> list[list[float]]:
resp = requests.post(f"{EMBED_URL}/embed", json={"texts": texts}, timeout=120) # Cold-start of BGE-M3 takes ~8s per text on CPU; first call can run ~minutes
# for a batch. Bump timeout to 10 minutes so the first batch doesn't kill the run.
resp = requests.post(f"{EMBED_URL}/embed", json={"texts": texts}, timeout=600)
resp.raise_for_status() resp.raise_for_status()
return resp.json()["embeddings"] return resp.json()["embeddings"]

View file

@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""
60_meili_index.py Push documents + chunks into Meilisearch for autocomplete.
W1 deliverable. Meilisearch is the typo-tolerant prefix-aware search engine in
the stack; it complements Postgres BM25 + pgvector (used by the chat). The
goal here is fast `/search` autocomplete that shows matching docs and chunks
as the user types sub-30ms.
Indexes created:
- documents id=doc_id, fields=[canonical_title, collection, doc_id]
- chunks id=chunk_pk, fields=[doc_id, chunk_id, page, content_en, content_pt]
Idempotent: re-running upserts. Skip `--reset` to rebuild from scratch.
Run from inside the disclosure-internal network OR with --meili-url override.
The default reads MEILI_MASTER_KEY + MEILISEARCH_URL from env.
Usage:
python3 scripts/maintain/60_meili_index.py
python3 scripts/maintain/60_meili_index.py --reset
python3 scripts/maintain/60_meili_index.py --doc-id <id>
"""
from __future__ import annotations
import argparse
import json
import os
import sys
from typing import Any
try:
import psycopg
import requests
except ImportError as e:
sys.exit(f"pip install psycopg[binary] requests # missing: {e}")
DATABASE_URL = os.getenv("DATABASE_URL") or os.getenv("SUPABASE_DB_URL")
MEILI_URL = os.getenv("MEILISEARCH_URL", "http://meilisearch:7700")
MEILI_KEY = os.getenv("MEILI_MASTER_KEY") or os.getenv("MEILISEARCH_API_KEY", "")
BATCH = int(os.getenv("MEILI_BATCH", "1000"))
def meili(method: str, path: str, body: Any = None) -> dict:
headers = {"Authorization": f"Bearer {MEILI_KEY}", "Content-Type": "application/json"}
r = requests.request(method, f"{MEILI_URL}{path}", headers=headers,
data=json.dumps(body) if body is not None else None,
timeout=120)
r.raise_for_status()
return r.json() if r.text else {}
def ensure_index(uid: str, primary_key: str, searchable: list[str], filterable: list[str]):
"""Create the index if missing, then set settings."""
try:
meili("POST", "/indexes", {"uid": uid, "primaryKey": primary_key})
print(f" created index {uid}")
except requests.HTTPError as e:
# 409 = already exists, OK.
if e.response.status_code not in (400, 409):
raise
meili("PATCH", f"/indexes/{uid}/settings", {
"searchableAttributes": searchable,
"filterableAttributes": filterable,
"displayedAttributes": ["*"],
"rankingRules": ["words", "typo", "proximity", "attribute", "sort", "exactness"],
"typoTolerance": {"enabled": True, "minWordSizeForTypos": {"oneTypo": 4, "twoTypos": 8}},
})
def push(uid: str, docs: list[dict]):
if not docs: return
meili("POST", f"/indexes/{uid}/documents", docs)
def main() -> int:
ap = argparse.ArgumentParser()
ap.add_argument("--reset", action="store_true", help="Delete and recreate indexes")
ap.add_argument("--doc-id", help="Reindex only one doc")
args = ap.parse_args()
if not DATABASE_URL: sys.exit("DATABASE_URL not set")
if not MEILI_KEY: sys.exit("MEILI_MASTER_KEY not set")
if args.reset and not args.doc_id:
print("Resetting indexes...")
for uid in ("documents", "chunks"):
try: meili("DELETE", f"/indexes/{uid}")
except requests.HTTPError: pass
ensure_index("documents", "doc_id",
searchable=["canonical_title", "collection", "doc_id"],
filterable=["collection", "classification"])
ensure_index("chunks", "chunk_pk",
searchable=["content_pt", "content_en", "doc_id", "chunk_id"],
filterable=["doc_id", "type", "classification", "ufo_anomaly", "is_searchable"])
with psycopg.connect(DATABASE_URL) as conn, conn.cursor() as cur:
# documents
where_doc = "WHERE doc_id = %s" if args.doc_id else ""
params = (args.doc_id,) if args.doc_id else ()
cur.execute(f"""
SELECT doc_id, canonical_title, collection, classification
FROM public.documents {where_doc}
""", params)
rows = cur.fetchall()
docs = [{"doc_id": r[0], "canonical_title": r[1] or r[0],
"collection": r[2] or "", "classification": r[3] or ""} for r in rows]
print(f"documents → meili: {len(docs)}")
for i in range(0, len(docs), BATCH):
push("documents", docs[i:i+BATCH])
# chunks (only searchable ones — drops scaffolding noise)
where_chunk = "WHERE c.is_searchable" + (" AND c.doc_id = %s" if args.doc_id else "")
cur.execute(f"""
SELECT c.chunk_pk, c.doc_id, c.chunk_id, c.page, c.type,
c.content_en, c.content_pt, c.classification, c.ufo_anomaly
FROM public.chunks c
{where_chunk}
""", params)
chunks: list[dict] = []
total = 0
for r in cur:
chunks.append({
"chunk_pk": r[0],
"doc_id": r[1],
"chunk_id": r[2],
"page": r[3],
"type": r[4],
"content_en": (r[5] or "")[:2000],
"content_pt": (r[6] or "")[:2000],
"classification": r[7] or "",
"ufo_anomaly": bool(r[8]),
"is_searchable": True,
})
if len(chunks) >= BATCH:
push("chunks", chunks)
total += len(chunks)
chunks = []
print(f" pushed {total} chunks...")
if chunks:
push("chunks", chunks)
total += len(chunks)
print(f"chunks → meili: {total}")
print("\n✓ done. Indexer enqueued; meili processes asynchronously.")
print(f" Verify: curl -H 'Authorization: Bearer ...' {MEILI_URL}/indexes/chunks/stats")
return 0
if __name__ == "__main__":
sys.exit(main())

View file

@ -97,7 +97,7 @@ def call_llm(prompt: str) -> str:
["claude", "-p", "--model", "sonnet", "--output-format", "text", ["claude", "-p", "--model", "sonnet", "--output-format", "text",
"--disallowed-tools", DISALLOWED], "--disallowed-tools", DISALLOWED],
input=prompt.encode("utf-8"), stdout=out, stderr=subprocess.PIPE, env=env, input=prompt.encode("utf-8"), stdout=out, stderr=subprocess.PIPE, env=env,
timeout=600, timeout=1200,
) )
if r.returncode != 0: if r.returncode != 0:
sys.exit(f"claude failed rc={r.returncode}: {r.stderr.decode('utf-8','replace')[:500]}") sys.exit(f"claude failed rc={r.returncode}: {r.stderr.decode('utf-8','replace')[:500]}")
@ -107,6 +107,62 @@ def call_llm(prompt: str) -> str:
except OSError: pass except OSError: pass
# Above this size, the reading version won't fit one Sonnet call (32k-token
# output ceiling + timeout), so we segment by page blocks and concatenate.
SEGMENT_THRESHOLD = 90_000
SEGMENT_CHARS = 45_000
PROMPT_SEGMENT = """You are a meticulous archivist-typographer for The Disclosure Bureau. This is
PART {n} OF {m} of a large scanned UAP/UFO document you receive the raw
machine-extracted text of THIS part only (chunk by chunk). The scan is messy:
duplicate transcriptions, OCR noise, repeated letterheads, classification
banners, page numbers, routing stamps.
Produce a clean, faithful, well-structured reading version of THIS PART in
Markdown.
RULES:
1. FAITHFUL never invent. Keep [redacted]/[ilegível] markers.
2. DEDUPLICATE within this part merge repeated content, keep unique details.
3. DROP page furniture (letterheads, banners, page numbers, routing stamps, OCR
garbage).
4. STRUCTURE with clear Markdown headings (##/###) and clean dialogue
(**SPEAKER:**) for transcripts. Do NOT write a document-level H1 title (the
document already has one); start at "## Part {n}" then sub-sections.
5. BILINGUAL for THIS part output English first under "### English", then
Brazilian Portuguese under "### Português". Natural pt-br with correct accents.
6. PRESERVE every investigative detail (sightings, coords, times, witnesses,
object descriptions, quotes).
Return ONLY the Markdown for this part (no code fence, no preamble). Start with
"## Part {n}".
DOCUMENT (doc_id: {doc_id}) PART {n} OF {m}, raw chunks follow:
{doc_text}
"""
def segment_text(text: str) -> list[str]:
"""Split doc text into blocks at [chunk ...] markers near SEGMENT_CHARS."""
import re as _re
if len(text) <= SEGMENT_CHARS:
return [text]
starts = [m.start() for m in _re.finditer(r"^\[chunk c\d+", text, _re.MULTILINE)]
if not starts:
return [text]
segs: list[str] = []
s = 0
while s < len(text):
cap = s + SEGMENT_CHARS
if cap >= len(text):
segs.append(text[s:]); break
cands = [p for p in starts if s < p < cap]
e = cands[-1] if cands else cap
segs.append(text[s:e]); s = e
return segs
def main() -> int: def main() -> int:
if len(sys.argv) < 2: if len(sys.argv) < 2:
sys.exit("usage: 40_reading_version.py <doc-id>") sys.exit("usage: 40_reading_version.py <doc-id>")
@ -118,9 +174,21 @@ def main() -> int:
print(f" {len(doc_text)} chars (~{len(doc_text)//4} tokens)") print(f" {len(doc_text)} chars (~{len(doc_text)//4} tokens)")
print("[2/3] generating reading version (Sonnet) ...") print("[2/3] generating reading version (Sonnet) ...")
md = call_llm(PROMPT.format(doc_id=doc_id, doc_text=doc_text)).strip() if len(doc_text) > SEGMENT_THRESHOLD:
if md.startswith("```"): segs = segment_text(doc_text)
md = "\n".join(l for l in md.splitlines() if not l.startswith("```")).strip() print(f" large doc → {len(segs)} segments")
parts: list[str] = []
for i, seg in enumerate(segs, 1):
print(f" segment {i}/{len(segs)} ({len(seg)} chars) ...")
p = call_llm(PROMPT_SEGMENT.format(n=i, m=len(segs), doc_id=doc_id, doc_text=seg)).strip()
if p.startswith("```"):
p = "\n".join(l for l in p.splitlines() if not l.startswith("```")).strip()
parts.append(p)
md = "\n\n---\n\n".join(parts)
else:
md = call_llm(PROMPT.format(doc_id=doc_id, doc_text=doc_text)).strip()
if md.startswith("```"):
md = "\n".join(l for l in md.splitlines() if not l.startswith("```")).strip()
front = ( front = (
f"---\nschema_version: \"0.1.0\"\ntype: reading\ndoc_id: {doc_id}\n" f"---\nschema_version: \"0.1.0\"\ntype: reading\ndoc_id: {doc_id}\n"

View file

@ -0,0 +1,69 @@
#!/usr/bin/env bash
# Generate the clean LLM reading version for every document, in parallel.
#
# - One doc per `claude -p` (Sonnet) via 40_reading_version.py
# - Skips docs that already have reading.md (idempotent — safe to re-run)
# - mkdir-based per-doc lock prevents two workers racing the same doc
# - WORKERS parallel workers (default 2)
#
# Run:
# ./run_reading_parallel.sh # all docs, 2 workers
# WORKERS=3 ./run_reading_parallel.sh # 3 workers
# ./run_reading_parallel.sh DOC1 DOC2 # specific docs only
set -uo pipefail
UFO="/Users/guto/ufo"
RAW="$UFO/raw"
GEN="$UFO/scripts/synthesize/40_reading_version.py"
WORKERS="${WORKERS:-2}"
if [ "$#" -gt 0 ]; then
DOCS=("$@")
else
DOCS=()
for d in "$RAW"/*--subagent; do
[ -f "$d/_index.json" ] || continue
DOCS+=("$(basename "$d" | sed 's/--subagent$//')")
done
fi
echo "=== reading-version generator ==="
echo " docs queued: ${#DOCS[@]}"
echo " workers: $WORKERS"
echo ""
process_one() {
local doc_id="$1"
local sub="$RAW/$doc_id--subagent"
local out="$sub/reading.md"
local log="$sub/_reading.log"
local lock="$sub/.reading.lock"
if [ -f "$out" ]; then
echo "[SKIP] $doc_id (already has reading.md)"
return 0
fi
if ! mkdir "$lock" 2>/dev/null; then
echo "[LOCK] $doc_id (another worker)"
return 0
fi
trap "rmdir '$lock' 2>/dev/null || true" EXIT
local t0=$(date +%s)
echo "[BEGIN] $doc_id"
if python3 "$GEN" "$doc_id" > "$log" 2>&1; then
echo "[OK] $doc_id ($(($(date +%s) - t0))s)"
else
echo "[FAIL] $doc_id ($(($(date +%s) - t0))s) — see $log"
fi
rmdir "$lock" 2>/dev/null || true
trap - EXIT
}
export -f process_one
export RAW GEN
printf '%s\n' "${DOCS[@]}" | xargs -n 1 -P "$WORKERS" -I {} bash -c 'process_one "$@"' _ {}
echo ""
echo "=== Done. reading.md count: ==="
ls "$RAW"/*--subagent/reading.md 2>/dev/null | wc -l

View file

@ -0,0 +1,16 @@
/**
* /api/debug/throw admin-only error injector. Throws on demand so we can
* verify Glitchtip is receiving events. Gated by /api/admin/* middleware (404
* for non-admins).
*
* Move the path under /api/admin/* so the W0-F1 middleware gate applies.
*/
import { withRequest } from "@/lib/logger";
export const runtime = "nodejs";
export async function GET(request: Request) {
const log = withRequest(request);
log.warn({ event: "debug_throw" }, "intentional error for Glitchtip smoke test");
throw new Error("debug_throw_smoke_test: glitchtip wiring verified at " + new Date().toISOString());
}

View file

@ -0,0 +1,95 @@
/**
* /api/search/autocomplete typo-tolerant prefix search via Meilisearch.
*
* Hits two indexes in parallel and returns a small merged result:
* - documents (title-level matches, used to jump to a doc)
* - chunks (passage-level matches, used for in-doc navigation)
*
* Target latency: sub-30ms inside the docker network. Falls back to empty
* results if Meilisearch is unreachable so the chat / hybrid_search aren't
* blocked. Auth: none same as /api/search/hybrid; corpus is public.
*/
import { NextResponse } from "next/server";
import { withRequest } from "@/lib/logger";
export const runtime = "nodejs";
export const dynamic = "force-dynamic";
const MEILI_URL = process.env.MEILISEARCH_URL || "http://meilisearch:7700";
const MEILI_KEY = process.env.MEILISEARCH_API_KEY || process.env.MEILI_MASTER_KEY || "";
interface DocHit {
doc_id: string;
canonical_title: string;
collection?: string;
}
interface ChunkHit {
chunk_pk: number;
doc_id: string;
chunk_id: string;
page: number;
type: string;
content_pt?: string;
content_en?: string;
ufo_anomaly?: boolean;
}
async function meiliSearch(index: string, q: string, limit: number): Promise<unknown[]> {
const r = await fetch(`${MEILI_URL}/indexes/${index}/search`, {
method: "POST",
headers: {
"Authorization": `Bearer ${MEILI_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ q, limit, attributesToHighlight: ["canonical_title", "content_pt", "content_en"] }),
signal: AbortSignal.timeout(2000),
});
if (!r.ok) throw new Error(`meili ${r.status}`);
const data = await r.json();
return data.hits ?? [];
}
export async function GET(request: Request) {
const log = withRequest(request);
const url = new URL(request.url);
const q = (url.searchParams.get("q") || "").trim();
const limit = Math.min(Number(url.searchParams.get("limit") || 8), 20);
if (q.length < 2) {
return NextResponse.json({ q, documents: [], chunks: [] });
}
if (!MEILI_KEY) {
log.warn({ event: "autocomplete_unconfigured" }, "MEILI key not set");
return NextResponse.json({ q, documents: [], chunks: [], reason: "meili_not_configured" });
}
const t0 = Date.now();
const [docs, chunks] = await Promise.all([
meiliSearch("documents", q, Math.min(limit, 5)).catch(() => []),
meiliSearch("chunks", q, limit).catch(() => []),
]) as [DocHit[], ChunkHit[]];
const dt = Date.now() - t0;
log.info({ event: "autocomplete", q, docs: docs.length, chunks: chunks.length, dt_ms: dt }, "autocomplete done");
return NextResponse.json({
q,
duration_ms: dt,
documents: docs.map((d) => ({
doc_id: d.doc_id,
title: d.canonical_title,
collection: d.collection,
href: `/d/${d.doc_id}`,
})),
chunks: chunks.map((c) => ({
chunk_id: c.chunk_id,
doc_id: c.doc_id,
page: c.page,
type: c.type,
excerpt: (c.content_pt || c.content_en || "").slice(0, 180),
ufo_anomaly: !!c.ufo_anomaly,
href: `/d/${c.doc_id}/p${String(c.page).padStart(3, "0")}#${c.chunk_id}`,
})),
});
}

View file

@ -18,6 +18,7 @@ import { createClient, isSupabaseConfigured } from "@/lib/supabase/server";
import { readDocument, readPage } from "@/lib/wiki"; import { readDocument, readPage } from "@/lib/wiki";
import { streamChat } from "@/lib/chat"; import { streamChat } from "@/lib/chat";
import { getLocale } from "@/components/locale-toggle"; import { getLocale } from "@/components/locale-toggle";
import { withRequest } from "@/lib/logger";
async function gatherContext(docId: string | null, pageId: string | null): Promise<string> { async function gatherContext(docId: string | null, pageId: string | null): Promise<string> {
const parts: string[] = []; const parts: string[] = [];
@ -129,8 +130,9 @@ Quotes verbatim do documento mantêm idioma original (inglês), narração ao re
export async function POST(request: Request, ctx: { params: Promise<{ id: string }> }) { export async function POST(request: Request, ctx: { params: Promise<{ id: string }> }) {
const { id: sessionId } = await ctx.params; const { id: sessionId } = await ctx.params;
const t0 = Date.now(); const t0 = Date.now();
const baseLog = withRequest(request).child({ session_id: sessionId.slice(0, 8) });
const log = (stage: string, extra: Record<string, unknown> = {}) => const log = (stage: string, extra: Record<string, unknown> = {}) =>
console.log(`[chat ${sessionId.slice(0, 8)}] ${stage}`, { dt: Date.now() - t0, ...extra }); baseLog.info({ stage, dt_ms: Date.now() - t0, ...extra }, stage);
log("POST received"); log("POST received");
if (!isSupabaseConfigured()) { if (!isSupabaseConfigured()) {

View file

@ -13,6 +13,7 @@ import { getLocale } from "@/components/locale-toggle";
import { AuthBar } from "@/components/auth-bar"; import { AuthBar } from "@/components/auth-bar";
import { ChatBubble } from "@/components/chat-bubble"; import { ChatBubble } from "@/components/chat-bubble";
import { DocReadingView } from "@/components/doc-reading-view"; import { DocReadingView } from "@/components/doc-reading-view";
import { AnomalyHighlights, type AnomalyFlag } from "@/components/anomaly-highlights";
import { MarkdownBody } from "@/components/markdown-body"; import { MarkdownBody } from "@/components/markdown-body";
export const dynamic = "force-dynamic"; export const dynamic = "force-dynamic";
@ -70,17 +71,31 @@ export default async function DocPage({
.sort((a, b) => b[1] - a[1]) .sort((a, b) => b[1] - a[1])
.slice(0, 6); .slice(0, 6);
// Count UFO/cryptid anomalies across chunks // Count UFO/cryptid anomalies across chunks + collect flags for the highlight panel
let ufoCount = 0;
let cryptidCount = 0;
let imageCount = 0; let imageCount = 0;
for (const [, chunks] of byPage) { const ufoFlags: AnomalyFlag[] = [];
const cryptidFlags: AnomalyFlag[] = [];
for (const [page, chunks] of byPage) {
for (const c of chunks) { for (const c of chunks) {
if (c.fm.ufo_anomaly_detected) ufoCount++; if (c.fm.ufo_anomaly_detected)
if (c.fm.cryptid_anomaly_detected) cryptidCount++; ufoFlags.push({
chunk_id: c.fm.chunk_id,
page,
type: c.fm.ufo_anomaly_type ?? null,
rationale: c.fm.ufo_anomaly_rationale ?? null,
});
if (c.fm.cryptid_anomaly_detected)
cryptidFlags.push({
chunk_id: c.fm.chunk_id,
page,
type: c.fm.cryptid_anomaly_type ?? null,
rationale: c.fm.cryptid_anomaly_rationale ?? null,
});
if (c.fm.type === "image") imageCount++; if (c.fm.type === "image") imageCount++;
} }
} }
const ufoCount = ufoFlags.length;
const cryptidCount = cryptidFlags.length;
const classification = (doc?.fm.highest_classification as string) ?? "—"; const classification = (doc?.fm.highest_classification as string) ?? "—";
const collection = (doc?.fm.collection as string) ?? "—"; const collection = (doc?.fm.collection as string) ?? "—";
@ -136,6 +151,8 @@ export default async function DocPage({
)} )}
</header> </header>
<AnomalyHighlights docId={docId} ufo={ufoFlags} cryptid={cryptidFlags} />
<DocReadingView docId={docId} reading={reading} chunksByPage={ordered} /> <DocReadingView docId={docId} reading={reading} chunksByPage={ordered} />
<ChatBubble context={{ doc_id: docId }} /> <ChatBubble context={{ doc_id: docId }} />

View file

@ -11,6 +11,7 @@ import { ChatBubble } from "@/components/chat-bubble";
import { AuthBar } from "@/components/auth-bar"; import { AuthBar } from "@/components/auth-bar";
import { EntityGraphMini } from "@/components/entity-graph-mini"; import { EntityGraphMini } from "@/components/entity-graph-mini";
import { EntityRelations } from "@/components/entity-relations"; import { EntityRelations } from "@/components/entity-relations";
import { EntityAttributes } from "@/components/entity-attributes";
import { import {
getEntityCore, getEntityCore,
getEntityMentionsByDoc, getEntityMentionsByDoc,
@ -111,6 +112,21 @@ export default async function EntityPage({
const classColor = CLASS_COLOR[folder as EntityClass]; const classColor = CLASS_COLOR[folder as EntityClass];
const classBg = CLASS_BG[folder as EntityClass]; const classBg = CLASS_BG[folder as EntityClass];
// The generated entity bodies hold only "# Title" + empty "## Description"
// headings — strip headings and see if any real prose remains.
const bodyProse = (wiki?.body ?? "").replace(/^#.*$/gm, "").trim();
const hasNarrativeProse = bodyProse.length > 20;
// Does the frontmatter carry any displayable description/attribute?
const fm = (wiki?.fm ?? {}) as Record<string, unknown>;
const arr = (v: unknown) => Array.isArray(v) && v.length > 0;
const fmHasContent = Boolean(
fm.narrative_summary_pt_br || fm.narrative_summary_en || fm.maneuver_notes ||
fm.shape || fm.color || fm.medium || fm.event_class || fm.person_class ||
fm.org_class || fm.geo_class || fm.date_start ||
arr(fm.countries) || arr(fm.roles) || arr(fm.affiliations) ||
arr(fm.primary_location_names) || arr(fm.regions_or_states),
);
return ( return (
<main className="min-h-screen p-6 md:p-10 max-w-6xl mx-auto"> <main className="min-h-screen p-6 md:p-10 max-w-6xl mx-auto">
<div className="flex items-start justify-between gap-4 mb-6"> <div className="flex items-start justify-between gap-4 mb-6">
@ -230,6 +246,9 @@ export default async function EntityPage({
<div className="grid grid-cols-1 lg:grid-cols-[1fr_320px] gap-8"> <div className="grid grid-cols-1 lg:grid-cols-[1fr_320px] gap-8">
{/* MAIN — narrative + chunks live */} {/* MAIN — narrative + chunks live */}
<article> <article>
{/* Structured description + attributes from frontmatter */}
{wiki?.fm && <EntityAttributes fm={wiki.fm as Record<string, unknown>} />}
{/* Live chunk previews — most impactful section */} {/* Live chunk previews — most impactful section */}
{sampleChunks.length > 0 && ( {sampleChunks.length > 0 && (
<section className="mb-10"> <section className="mb-10">
@ -283,17 +302,18 @@ export default async function EntityPage({
</section> </section>
)} )}
{/* Narrative body (Haiku stub OK quando rico) */} {/* Narrative body only when it carries real prose, not just the
{wiki?.body && wiki.body.trim().length > 30 && ( empty "## Description" headings the generator leaves behind. */}
{hasNarrativeProse && (
<section className="pt-6 border-t border-[rgba(0,255,156,0.12)]"> <section className="pt-6 border-t border-[rgba(0,255,156,0.12)]">
<h2 className="font-mono text-sm text-[#7fdbff] uppercase tracking-widest mb-3 border-l-2 border-[#7fdbff] pl-3"> <h2 className="font-mono text-sm text-[#7fdbff] uppercase tracking-widest mb-3 border-l-2 border-[#7fdbff] pl-3">
Narrativa Narrativa
</h2> </h2>
<MarkdownBody>{wiki.body}</MarkdownBody> <MarkdownBody>{wiki!.body}</MarkdownBody>
</section> </section>
)} )}
{sampleChunks.length === 0 && (!wiki?.body || wiki.body.trim().length === 0) && ( {sampleChunks.length === 0 && !hasNarrativeProse && !fmHasContent && (
<div className="text-[#5a6678] italic text-sm p-6 border border-[rgba(255,165,0,0.30)] bg-[rgba(255,165,0,0.05)] rounded"> <div className="text-[#5a6678] italic text-sm p-6 border border-[rgba(255,165,0,0.30)] bg-[rgba(255,165,0,0.05)] rounded">
Entidade ainda sem chunks indexados na DB. Aguarde o indexer terminar. Entidade ainda sem chunks indexados na DB. Aguarde o indexer terminar.
</div> </div>

View file

@ -0,0 +1,135 @@
/**
* AnomalyHighlights prominent UAP / cryptid anomaly panel for the document
* page. The clean reading version is the default body, but the investigative
* "destaque" of every flagged passage must stay visible regardless of which
* view (reading or scan) is active. Identical type+rationale flags are grouped
* and each group links to the per-page scan where the anomaly was detected.
*/
import Link from "next/link";
export interface AnomalyFlag {
chunk_id: string;
page: number;
type: string | null;
rationale: string | null;
}
function clean(v: string | null): string | null {
const s = typeof v === "string" ? v.trim() : "";
return s && s.toLowerCase() !== "null" ? s : null;
}
interface Group {
type: string | null;
rationale: string | null; // shown only when the group has a single flag
count: number;
pages: number[];
}
// Group by anomaly type so the panel stays a scannable "destaque" overview.
// Per-passage rationale is kept only when a type has exactly one flag; the full
// per-chunk rationale remains available in the "trechos · scan original" view.
function groupFlags(flags: AnomalyFlag[]): Group[] {
const m = new Map<string, Group>();
for (const f of flags) {
const type = clean(f.type);
const rationale = clean(f.rationale);
const key = type ?? "anomalia";
const g = m.get(key) ?? { type, rationale, count: 0, pages: [] };
g.count += 1;
g.rationale = g.count === 1 ? rationale : null;
if (!g.pages.includes(f.page)) g.pages.push(f.page);
m.set(key, g);
}
return Array.from(m.values())
.map((g) => ({ ...g, pages: g.pages.sort((a, b) => a - b) }))
.sort((a, b) => b.count - a.count || a.pages[0] - b.pages[0]);
}
function pad(p: number): string {
return String(p).padStart(3, "0");
}
function PageChips({ docId, pages }: { docId: string; pages: number[] }) {
const shown = pages.slice(0, 14);
const extra = pages.length - shown.length;
return (
<span className="inline-flex flex-wrap gap-1 align-middle">
{shown.map((p) => (
<Link
key={p}
href={`/d/${docId}/p${pad(p)}`}
className="font-mono text-[10px] px-1.5 py-0.5 border border-[rgba(127,219,255,0.30)] text-[#7fdbff] rounded hover:border-[#00ff9c] hover:text-[#00ff9c]"
>
p{p}
</Link>
))}
{extra > 0 && <span className="font-mono text-[10px] text-[#5a6678]">+{extra}</span>}
</span>
);
}
export function AnomalyHighlights({
docId,
ufo,
cryptid,
}: {
docId: string;
ufo: AnomalyFlag[];
cryptid: AnomalyFlag[];
}) {
if (ufo.length === 0 && cryptid.length === 0) return null;
const ufoGroups = groupFlags(ufo);
const cryptidGroups = groupFlags(cryptid);
return (
<section className="mb-6 border border-[rgba(0,255,156,0.40)] bg-[rgba(0,255,156,0.05)] rounded p-4">
{ufo.length > 0 && (
<>
<h2 className="font-mono text-sm text-[#00ff9c] mb-3 flex items-center gap-2">
🛸 Anomalias UAP destacadas
<span className="text-[#5a6678]">
({ufo.length} {ufo.length === 1 ? "trecho" : "trechos"} · {ufoGroups.length}{" "}
{ufoGroups.length === 1 ? "tipo" : "tipos"})
</span>
</h2>
<ul className="space-y-2.5">
{ufoGroups.map((g, i) => (
<li key={i} className="text-sm text-[#c8d4e6] leading-relaxed">
<span className="font-mono text-[#00ff9c]">🛸 {g.type ?? "anomalia"}</span>
{g.count > 1 && (
<span className="font-mono text-[10px] text-[#5a6678]"> ×{g.count}</span>
)}
{g.rationale && <span className="text-[#c8d4e6]"> {g.rationale}</span>}{" "}
<PageChips docId={docId} pages={g.pages} />
</li>
))}
</ul>
</>
)}
{cryptid.length > 0 && (
<div className={ufo.length > 0 ? "mt-4 pt-4 border-t border-[rgba(155,93,229,0.25)]" : ""}>
<h2 className="font-mono text-sm text-[#9b5de5] mb-3 flex items-center gap-2">
👁 Anomalias cryptid destacadas
<span className="text-[#5a6678]">
({cryptid.length} {cryptid.length === 1 ? "trecho" : "trechos"})
</span>
</h2>
<ul className="space-y-2.5">
{cryptidGroups.map((g, i) => (
<li key={i} className="text-sm text-[#c8d4e6] leading-relaxed">
<span className="font-mono text-[#9b5de5]">👁 {g.type ?? "anomalia"}</span>
{g.count > 1 && (
<span className="font-mono text-[10px] text-[#5a6678]"> ×{g.count}</span>
)}
{g.rationale && <span className="text-[#c8d4e6]"> {g.rationale}</span>}{" "}
<PageChips docId={docId} pages={g.pages} />
</li>
))}
</ul>
</div>
)}
</section>
);
}

View file

@ -0,0 +1,164 @@
/**
* EntityAttributes renders an entity's descriptive content and structured
* attributes straight from its wiki frontmatter. The generated entity files
* carry their real content in YAML fields (narrative_summary_*, maneuver_notes,
* shape, color, roles, countries, ) while the markdown body holds only empty
* "## Description" headings so the page must surface the frontmatter.
*/
type FM = Record<string, unknown>;
const ATTR_LABELS: Record<string, string> = {
event_class: "Tipo de evento",
date_start: "Início",
date_end: "Fim",
date_confidence: "Confiança da data",
primary_location_names: "Locais",
primary_location_geo_classes: "Classe do local",
geo_class: "Classe geográfica",
countries: "Países",
regions_or_states: "Regiões / estados",
org_class: "Tipo de organização",
person_class: "Tipo de pessoa",
affiliations: "Afiliações",
roles: "Funções / papéis",
shape: "Forma",
color: "Cor",
medium: "Meio",
size_estimate_m: "Tamanho estimado (m)",
altitude_ft: "Altitude (ft)",
speed_kts: "Velocidade (kt)",
};
// Order in which attributes are shown (only those present render).
const ATTR_ORDER = [
"event_class",
"person_class",
"org_class",
"shape",
"color",
"medium",
"size_estimate_m",
"altitude_ft",
"speed_kts",
"date_start",
"date_end",
"date_confidence",
"geo_class",
"countries",
"regions_or_states",
"primary_location_names",
"primary_location_geo_classes",
"affiliations",
"roles",
];
function clean(v: unknown): string | null {
const s = typeof v === "string" ? v.trim() : "";
return s && s.toLowerCase() !== "null" ? s : null;
}
// Placeholder values that carry no real attribute information — hidden from the
// ATRIBUTOS grid (but never from the free-text description).
const EMPTY_TOKENS = new Set([
"null",
"none",
"n/a",
"na",
"unknown",
"unidentified",
"undetermined",
"unspecified",
"not specified",
"not stated",
"not applicable",
]);
function isEmptyToken(s: string): boolean {
return EMPTY_TOKENS.has(s.trim().toLowerCase());
}
function fmtValue(v: unknown): string | null {
if (v == null) return null;
if (Array.isArray(v)) {
const items = v
.map((x) => (typeof x === "string" ? x.trim() : String(x)))
.filter((x) => x && !x.startsWith("[[") && !isEmptyToken(x));
return items.length ? items.join(", ") : null;
}
if (typeof v === "number") return String(v);
const s = clean(v);
return s && !isEmptyToken(s) ? s : null;
}
export function EntityAttributes({ fm }: { fm: FM }) {
const ptText = clean(fm.narrative_summary_pt_br) ?? clean(fm.description_pt_br);
const enText = clean(fm.narrative_summary_en) ?? clean(fm.description_en);
const notes = clean(fm.maneuver_notes); // source-language only (uap_object)
const attrs = ATTR_ORDER.map((k) => [k, fmtValue(fm[k])] as const).filter(
([, v]) => v !== null,
);
const hasDescription = Boolean(ptText || enText || notes);
if (!hasDescription && attrs.length === 0) return null;
return (
<section className="mb-10">
{hasDescription && (
<>
{ptText && (
<div className="mb-4">
<h2 className="font-mono text-sm text-[#7fdbff] uppercase tracking-widest mb-2 border-l-2 border-[#7fdbff] pl-3">
Descrição (PT-BR)
</h2>
<p className="text-[15px] leading-relaxed text-[#c8d4e6]">{ptText}</p>
</div>
)}
{enText && (
<div className="mb-4">
<h2 className="font-mono text-sm text-[#7fdbff] uppercase tracking-widest mb-2 border-l-2 border-[#7fdbff] pl-3">
Description (EN)
</h2>
<p className="text-[15px] leading-relaxed text-[#c8d4e6]">{enText}</p>
</div>
)}
{notes && !ptText && !enText && (
<div className="mb-4">
<h2 className="font-mono text-sm text-[#7fdbff] uppercase tracking-widest mb-2 border-l-2 border-[#7fdbff] pl-3">
Descrição · Description
</h2>
<p className="text-[15px] leading-relaxed text-[#c8d4e6]">{notes}</p>
</div>
)}
{notes && (ptText || enText) && (
<div className="mb-4">
<h3 className="font-mono text-[11px] text-[#8896aa] uppercase tracking-widest mb-1">
Notas de manobra / aparência
</h3>
<p className="text-sm leading-relaxed text-[#8896aa]">{notes}</p>
</div>
)}
</>
)}
{attrs.length > 0 && (
<div className="mt-2">
<h3 className="font-mono text-[11px] text-[#8896aa] uppercase tracking-widest mb-2">
Atributos
</h3>
<dl className="grid grid-cols-1 sm:grid-cols-2 gap-x-6 gap-y-2">
{attrs.map(([k, v]) => (
<div key={k} className="flex items-baseline gap-2 border-b border-[rgba(127,219,255,0.10)] pb-1.5">
<dt className="font-mono text-[11px] text-[#5a6678] uppercase tracking-wide shrink-0 min-w-[42%]">
{ATTR_LABELS[k] ?? k}
</dt>
<dd className="text-sm text-[#c8d4e6]">{v}</dd>
</div>
))}
</dl>
</div>
)}
</section>
);
}

View file

@ -0,0 +1,137 @@
"use client";
/**
* SearchAutocomplete type-as-you-go dropdown on the /search input.
*
* Hits /api/search/autocomplete (Meilisearch) with debounced fetch and renders
* a two-section dropdown: matching documents (jump targets) and matching
* chunks (in-doc passages with excerpt). Sub-30ms target. Keyboard navigation
* via Up/Down + Enter. Esc closes.
*/
import { useEffect, useRef, useState } from "react";
import Link from "next/link";
interface DocSuggestion {
doc_id: string;
title: string;
collection?: string;
href: string;
}
interface ChunkSuggestion {
chunk_id: string;
doc_id: string;
page: number;
type: string;
excerpt: string;
ufo_anomaly: boolean;
href: string;
}
interface ApiResponse {
q: string;
duration_ms?: number;
documents: DocSuggestion[];
chunks: ChunkSuggestion[];
}
export function SearchAutocomplete({ query, onPick }: { query: string; onPick?: () => void }) {
const [data, setData] = useState<ApiResponse | null>(null);
const [loading, setLoading] = useState(false);
const [open, setOpen] = useState(false);
const timer = useRef<ReturnType<typeof setTimeout> | null>(null);
const abort = useRef<AbortController | null>(null);
useEffect(() => {
const q = query.trim();
if (q.length < 2) {
setData(null); setOpen(false); return;
}
if (timer.current) clearTimeout(timer.current);
timer.current = setTimeout(async () => {
abort.current?.abort();
abort.current = new AbortController();
setLoading(true);
try {
const r = await fetch(`/api/search/autocomplete?q=${encodeURIComponent(q)}`, {
signal: abort.current.signal,
});
if (!r.ok) throw new Error(`HTTP ${r.status}`);
const j = (await r.json()) as ApiResponse;
setData(j);
setOpen(j.documents.length + j.chunks.length > 0);
} catch (e) {
if ((e as Error).name === "AbortError") return;
setData(null); setOpen(false);
} finally {
setLoading(false);
}
}, 150);
return () => { if (timer.current) clearTimeout(timer.current); };
}, [query]);
if (!open || !data) return null;
return (
<div className="absolute z-30 left-0 right-0 mt-1 max-h-[60vh] overflow-y-auto bg-[#0a121e] border border-[#00ff9c] rounded shadow-lg">
<div className="flex items-center justify-between px-3 py-1.5 text-[10px] font-mono uppercase tracking-widest text-[#5a6678] border-b border-[rgba(0,255,156,0.20)]">
<span>
autocomplete · {data.documents.length} docs · {data.chunks.length} trechos
</span>
<span>{loading ? "…" : `${data.duration_ms ?? "?"}ms`}</span>
</div>
{data.documents.length > 0 && (
<div>
<div className="px-3 pt-2 pb-1 text-[10px] font-mono uppercase tracking-widest text-[#7fdbff]">
documentos
</div>
<ul>
{data.documents.map((d) => (
<li key={d.doc_id}>
<Link
href={d.href}
onClick={onPick}
className="block px-3 py-2 hover:bg-[rgba(0,255,156,0.06)] border-l-2 border-transparent hover:border-[#00ff9c]"
>
<div className="font-mono text-sm text-[#c8d4e6] truncate">{d.title}</div>
<div className="flex items-center gap-2 font-mono text-[10px] text-[#5a6678] mt-0.5">
<span>{d.doc_id}</span>
{d.collection && <span>· {d.collection}</span>}
</div>
</Link>
</li>
))}
</ul>
</div>
)}
{data.chunks.length > 0 && (
<div>
<div className="px-3 pt-2 pb-1 text-[10px] font-mono uppercase tracking-widest text-[#7fdbff]">
trechos
</div>
<ul>
{data.chunks.map((c) => (
<li key={`${c.doc_id}-${c.chunk_id}`}>
<Link
href={c.href}
onClick={onPick}
className="block px-3 py-2 hover:bg-[rgba(0,255,156,0.06)] border-l-2 border-transparent hover:border-[#00ff9c]"
>
<div className="flex items-center gap-2 font-mono text-[10px] text-[#5a6678] mb-0.5">
<span className="text-[#00ff9c]">{c.chunk_id}</span>
<span>p{c.page}</span>
<span>{c.type}</span>
{c.ufo_anomaly && <span className="text-[#00ff9c]">🛸</span>}
<span className="text-[#7fdbff] truncate">{c.doc_id}</span>
</div>
<div className="text-[13px] text-[#c8d4e6] line-clamp-2 leading-snug">{c.excerpt}</div>
</Link>
</li>
))}
</ul>
</div>
)}
</div>
);
}

View file

@ -9,6 +9,7 @@ import Image from "next/image";
import Link from "next/link"; import Link from "next/link";
import { useEffect, useState } from "react"; import { useEffect, useState } from "react";
import { useRouter, useSearchParams } from "next/navigation"; import { useRouter, useSearchParams } from "next/navigation";
import { SearchAutocomplete } from "./search-autocomplete";
interface Hit { interface Hit {
chunk_id: string; chunk_id: string;
@ -94,7 +95,7 @@ export function SearchPanel({
onSubmit={submit} onSubmit={submit}
className="space-y-3 mb-8 p-4 border border-[rgba(0,255,156,0.15)] bg-[#0a121e] rounded" className="space-y-3 mb-8 p-4 border border-[rgba(0,255,156,0.15)] bg-[#0a121e] rounded"
> >
<div> <div className="relative">
<label className="font-mono text-[10px] uppercase tracking-widest text-[#5a6678] block mb-1"> <label className="font-mono text-[10px] uppercase tracking-widest text-[#5a6678] block mb-1">
query query
</label> </label>
@ -105,6 +106,7 @@ export function SearchPanel({
className="w-full bg-transparent border border-[rgba(0,255,156,0.20)] focus:border-[#00ff9c] rounded px-3 py-2 font-mono text-sm text-[#c8d4e6] outline-none" className="w-full bg-transparent border border-[rgba(0,255,156,0.20)] focus:border-[#00ff9c] rounded px-3 py-2 font-mono text-sm text-[#c8d4e6] outline-none"
autoFocus autoFocus
/> />
<SearchAutocomplete query={q} onPick={() => setQ("")} />
</div> </div>
<div className="flex flex-wrap items-end gap-3"> <div className="flex flex-wrap items-end gap-3">
<div> <div>

33
web/instrumentation.ts Normal file
View file

@ -0,0 +1,33 @@
/**
* Next.js instrumentation hook loads Sentry (Glitchtip) init on server/edge.
*
* https://nextjs.org/docs/app/building-your-application/optimizing/instrumentation
*/
export async function register() {
if (process.env.NEXT_RUNTIME === "nodejs") {
await import("./sentry.server.config");
}
if (process.env.NEXT_RUNTIME === "edge") {
// Edge runtime gets a slimmer init via the same DSN; the SDK auto-detects.
await import("./sentry.server.config");
}
}
// Capture unhandled promise rejections in Server Components / API routes and
// forward them through Sentry's hook. Loaded only on the server.
// Forward unhandled errors from Server Components / Route Handlers to Sentry.
// Loose typing so it tracks any captureRequestError signature change in
// @sentry/nextjs — observability code must not block real errors.
export const onRequestError = async (
err: unknown,
request: Parameters<typeof import("@sentry/nextjs").captureRequestError>[1],
context: Parameters<typeof import("@sentry/nextjs").captureRequestError>[2],
) => {
if (process.env.NEXT_RUNTIME !== "nodejs") return;
try {
const { captureRequestError } = await import("@sentry/nextjs");
await captureRequestError(err, request, context);
} catch {
/* never let observability swallow the original error */
}
};

View file

@ -12,7 +12,11 @@ import { spawn } from "node:child_process";
import type { ChatProvider, ChatRequest, ChatResponse } from "./types"; import type { ChatProvider, ChatRequest, ChatResponse } from "./types";
const MODEL = process.env.CLAUDE_CODE_MODEL || "haiku"; const MODEL = process.env.CLAUDE_CODE_MODEL || "haiku";
const TIMEOUT_MS = 90_000; // W1-TD#30: subprocess timeout is now configurable. Default 90s matches the
// previous hard-coded value. Lower it (e.g. 60s) when the provider should bail
// out of slow generations sooner; raise it (e.g. 180s) when running heavier
// models like opus on long contexts.
const TIMEOUT_MS = Number(process.env.CLAUDE_CODE_TIMEOUT_MS || 90_000);
function buildPrompt(req: ChatRequest): string { function buildPrompt(req: ChatRequest): string {
// Single-shot prompt: collapse history into a structured transcript. // Single-shot prompt: collapse history into a structured transcript.

View file

@ -23,6 +23,105 @@ const PRIMARY = process.env.OPENROUTER_MODEL || "deepseek/deepseek-v4-flash:free
const FALLBACK = process.env.OPENROUTER_FALLBACK_MODEL || "nvidia/nemotron-3-super-120b-a12b:free"; const FALLBACK = process.env.OPENROUTER_FALLBACK_MODEL || "nvidia/nemotron-3-super-120b-a12b:free";
const ENDPOINT = "https://openrouter.ai/api/v1/chat/completions"; const ENDPOINT = "https://openrouter.ai/api/v1/chat/completions";
// W1-TD#23: retry + circuit breaker for OpenRouter free-tier flakiness.
// Transient errors (429/502/503/504/network) are retried up to RETRY_MAX times
// with exponential backoff. Repeated PRIMARY failures within CB_WINDOW_MS
// trip an in-memory circuit breaker that promotes FALLBACK as the active
// model for CB_COOLDOWN_MS — protecting the chat from a single bad model.
const RETRY_MAX = Number(process.env.OPENROUTER_RETRY_MAX || 2);
const RETRY_BASE_MS = Number(process.env.OPENROUTER_RETRY_BASE_MS || 400);
const CB_WINDOW_MS = Number(process.env.OPENROUTER_CB_WINDOW_MS || 60_000);
const CB_THRESHOLD = Number(process.env.OPENROUTER_CB_THRESHOLD || 3);
const CB_COOLDOWN_MS = Number(process.env.OPENROUTER_CB_COOLDOWN_MS || 120_000);
const RETRYABLE_STATUSES = new Set([408, 425, 429, 500, 502, 503, 504]);
interface ModelBreaker { failures: number[]; openedAt: number | null }
const breakers = new Map<string, ModelBreaker>();
function breakerFor(model: string): ModelBreaker {
let b = breakers.get(model);
if (!b) { b = { failures: [], openedAt: null }; breakers.set(model, b); }
return b;
}
function isCircuitOpen(model: string): boolean {
const b = breakerFor(model);
if (!b.openedAt) return false;
if (Date.now() - b.openedAt > CB_COOLDOWN_MS) {
// Half-open: clear and let the next call probe the upstream.
b.openedAt = null; b.failures = [];
return false;
}
return true;
}
function recordFailure(model: string): void {
const b = breakerFor(model);
const now = Date.now();
b.failures = b.failures.filter((t) => now - t < CB_WINDOW_MS);
b.failures.push(now);
if (b.failures.length >= CB_THRESHOLD) b.openedAt = now;
}
function recordSuccess(model: string): void {
const b = breakerFor(model);
b.failures = []; b.openedAt = null;
}
/** Pick the active model honoring an open circuit on PRIMARY. */
function pickModel(preferred: string): string {
if (preferred === PRIMARY && isCircuitOpen(PRIMARY)) return FALLBACK;
return preferred;
}
/** Fetch wrapper with retry + breaker accounting. */
async function fetchOpenRouter(
body: Record<string, unknown>,
preferredModel: string,
): Promise<{ res: Response; model: string }> {
const model = pickModel(preferredModel);
body.model = model;
let lastErr: unknown;
for (let attempt = 0; attempt <= RETRY_MAX; attempt++) {
try {
const res = await fetch(ENDPOINT, {
method: "POST",
headers: headers(),
body: JSON.stringify(body),
});
if (res.ok) {
recordSuccess(model);
return { res, model };
}
if (!RETRYABLE_STATUSES.has(res.status)) {
const txt = await res.text();
const err = new Error(`openrouter HTTP ${res.status}: ${txt.slice(0, 300)}`);
if (res.status === 429 || res.status === 402) {
(err as Error & { isRateLimit?: boolean }).isRateLimit = true;
}
recordFailure(model);
throw err;
}
// Retryable — wait with exponential backoff, honor Retry-After if present.
const ra = Number(res.headers.get("retry-after"));
const waitMs = Number.isFinite(ra) && ra > 0
? ra * 1000
: RETRY_BASE_MS * Math.pow(2, attempt);
await new Promise((r) => setTimeout(r, waitMs));
lastErr = new Error(`openrouter HTTP ${res.status} (attempt ${attempt + 1}/${RETRY_MAX + 1})`);
} catch (e) {
// Network/abort — also retry up to RETRY_MAX.
lastErr = e;
if (attempt >= RETRY_MAX) break;
await new Promise((r) => setTimeout(r, RETRY_BASE_MS * Math.pow(2, attempt)));
}
}
recordFailure(model);
throw lastErr instanceof Error ? lastErr : new Error(String(lastErr));
}
type OAMsg = type OAMsg =
| { role: "system" | "user"; content: string } | { role: "system" | "user"; content: string }
| { role: "assistant"; content?: string | null; tool_calls?: OAToolCall[] } | { role: "assistant"; content?: string | null; tool_calls?: OAToolCall[] }
@ -74,35 +173,26 @@ export interface SendOnceReq {
} }
/** Non-streaming single shot — used by claude-code fallback path and tests. */ /** Non-streaming single shot — used by claude-code fallback path and tests. */
export async function sendOnce(req: SendOnceReq, model = PRIMARY): Promise<{ export async function sendOnce(req: SendOnceReq, preferredModel = PRIMARY): Promise<{
content: string; content: string;
model: string; model: string;
tokensIn?: number; tokensIn?: number;
tokensOut?: number; tokensOut?: number;
}> { }> {
const body = { const body: Record<string, unknown> = {
model,
messages: [ messages: [
{ role: "system", content: req.system }, { role: "system", content: req.system },
...req.messages.slice(-20), ...req.messages.slice(-20),
], ],
max_tokens: req.maxTokens ?? 1024, max_tokens: req.maxTokens ?? 1024,
}; };
const res = await fetch(ENDPOINT, { const { res, model } = await fetchOpenRouter(body, preferredModel);
method: "POST",
headers: headers(),
body: JSON.stringify(body),
});
if (!res.ok) {
const txt = await res.text();
const err = new Error(`openrouter HTTP ${res.status}: ${txt.slice(0, 300)}`);
if (res.status === 429 || res.status === 402) {
(err as Error & { isRateLimit?: boolean }).isRateLimit = true;
}
throw err;
}
const data = await res.json(); const data = await res.json();
if (data.error) throw new Error(`openrouter error: ${data.error.message}`); if (data.error) {
recordFailure(model);
throw new Error(`openrouter error: ${data.error.message}`);
}
recordSuccess(model);
return { return {
content: data.choices?.[0]?.message?.content ?? "", content: data.choices?.[0]?.message?.content ?? "",
model: data.model ?? model, model: data.model ?? model,
@ -336,12 +426,11 @@ export async function streamWithTools(
async function openrouterStreamCall( async function openrouterStreamCall(
messages: OAMsg[], messages: OAMsg[],
model: string, preferredModel: string,
opts: { withTools?: boolean } = {}, opts: { withTools?: boolean } = {},
): Promise<Response> { ): Promise<Response> {
const withTools = opts.withTools !== false; const withTools = opts.withTools !== false;
const body: Record<string, unknown> = { const body: Record<string, unknown> = {
model,
messages, messages,
stream: true, stream: true,
max_tokens: 1024, max_tokens: 1024,
@ -350,19 +439,7 @@ async function openrouterStreamCall(
body.tools = TOOL_DEFINITIONS; body.tools = TOOL_DEFINITIONS;
body.tool_choice = "auto"; body.tool_choice = "auto";
} }
const res = await fetch(ENDPOINT, { const { res } = await fetchOpenRouter(body, preferredModel);
method: "POST",
headers: headers(),
body: JSON.stringify(body),
});
if (!res.ok) {
const txt = await res.text();
const err = new Error(`openrouter HTTP ${res.status}: ${txt.slice(0, 300)}`);
if (res.status === 429 || res.status === 402) {
(err as Error & { isRateLimit?: boolean }).isRateLimit = true;
}
throw err;
}
return res; return res;
} }

77
web/lib/logger.ts Normal file
View file

@ -0,0 +1,77 @@
/**
* Structured logger pino with JSON output in production, pretty in dev.
*
* Use as:
* import { log, withRequest } from "@/lib/logger";
* log.info({ doc_id, page }, "rendering page");
* log.error({ err }, "embed-service down");
*
* For request-scoped logging:
* const reqLog = withRequest(request);
* reqLog.info({ duration_ms: dt }, "hybrid_search done");
*
* Edge runtime falls back to a console adapter (pino requires node).
*/
import pino from "pino";
// Edge runtime doesn't support pino's worker thread; detect and fall back.
const isEdge = typeof process === "undefined" || process.env.NEXT_RUNTIME === "edge";
function build(): pino.Logger {
if (isEdge) {
// Minimal adapter so middleware can call log.* without crashing.
const noop = () => undefined;
return {
info: (o: unknown, m?: string) => console.log(JSON.stringify({ level: "info", msg: m, ...(typeof o === "object" ? o : { v: o }) })),
warn: (o: unknown, m?: string) => console.warn(JSON.stringify({ level: "warn", msg: m, ...(typeof o === "object" ? o : { v: o }) })),
error: (o: unknown, m?: string) => console.error(JSON.stringify({ level: "error", msg: m, ...(typeof o === "object" ? o : { v: o }) })),
debug: noop,
trace: noop,
fatal: (o: unknown, m?: string) => console.error(JSON.stringify({ level: "fatal", msg: m, ...(typeof o === "object" ? o : { v: o }) })),
child: () => build(),
} as unknown as pino.Logger;
}
return pino({
level: process.env.LOG_LEVEL || "info",
base: {
app: "disclosure-web",
env: process.env.NODE_ENV || "development",
},
timestamp: pino.stdTimeFunctions.isoTime,
// Production: NDJSON (one JSON per line). Dev: pretty-printed.
transport: process.env.NODE_ENV === "production" ? undefined : {
target: "pino-pretty",
options: { colorize: true, translateTime: "SYS:HH:MM:ss.l" },
},
});
}
export const log: pino.Logger = build();
/** Create a child logger bound to a request's correlation id. */
export function withRequest(req: Request | { headers: Headers }): pino.Logger {
const id = req.headers.get("x-correlation-id") ||
req.headers.get("x-request-id") ||
cryptoRandomId();
return log.child({ correlation_id: id });
}
/** Get-or-mint a correlation id for a request. */
export function correlationId(req: Request | { headers: Headers }): string {
return req.headers.get("x-correlation-id") ||
req.headers.get("x-request-id") ||
cryptoRandomId();
}
function cryptoRandomId(): string {
// 16 hex chars — short enough for logs, enough entropy for non-security uses.
// Both edge runtime and Node 19+ expose globalThis.crypto; older Node falls
// back to Math.random (acceptable: this is correlation, not security).
const g = globalThis as { crypto?: { getRandomValues?: (a: Uint8Array) => void } };
if (g.crypto?.getRandomValues) {
const buf = new Uint8Array(8);
g.crypto.getRandomValues(buf);
return Array.from(buf, (b) => b.toString(16).padStart(2, "0")).join("");
}
return Math.random().toString(36).slice(2, 18);
}

View file

@ -6,12 +6,17 @@
*/ */
import { NextResponse, type NextRequest } from "next/server"; import { NextResponse, type NextRequest } from "next/server";
import { createServerClient, type CookieOptions } from "@supabase/ssr"; import { createServerClient, type CookieOptions } from "@supabase/ssr";
import { log, correlationId } from "@/lib/logger";
export async function middleware(request: NextRequest) { export async function middleware(request: NextRequest) {
const t0 = Date.now();
const url = process.env.NEXT_PUBLIC_SUPABASE_URL; const url = process.env.NEXT_PUBLIC_SUPABASE_URL;
const key = process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY; const key = process.env.NEXT_PUBLIC_SUPABASE_ANON_KEY;
const reqId = correlationId(request);
let response = NextResponse.next({ request }); let response = NextResponse.next({ request });
// Stamp every response so downstream handlers and the client see the same id.
response.headers.set("x-correlation-id", reqId);
if (!url || !key) { if (!url || !key) {
// Supabase not configured — skip auth refresh entirely // Supabase not configured — skip auth refresh entirely
@ -34,10 +39,11 @@ export async function middleware(request: NextRequest) {
// Trigger refresh (silently if token still valid) // Trigger refresh (silently if token still valid)
const { data: { user } } = await supabase.auth.getUser(); const { data: { user } } = await supabase.auth.getUser();
// Gate /admin/* by role. Non-admin (including anonymous) gets the public // Gate /admin/* AND /api/admin/* by role. Non-admin (including anonymous)
// 404, not a redirect — we don't want to leak the existence of the route. // gets a public 404, not a redirect — we don't want to leak the existence
// of the route. (Audit W0-F1 — fechado 2026-05-23.)
const pathname = request.nextUrl.pathname; const pathname = request.nextUrl.pathname;
if (pathname.startsWith("/admin")) { if (pathname.startsWith("/admin") || pathname.startsWith("/api/admin")) {
if (!user) { if (!user) {
return new NextResponse("Not Found", { status: 404 }); return new NextResponse("Not Found", { status: 404 });
} }
@ -51,6 +57,22 @@ export async function middleware(request: NextRequest) {
} }
} }
// Log API requests with correlation id + timing. Skip noisy paths (assets,
// crops) and prefer one structured line per request so Glitchtip / log
// aggregators can correlate.
if (pathname.startsWith("/api/") && !pathname.startsWith("/api/static") && !pathname.startsWith("/api/crop")) {
log.info(
{
event: "http_request",
method: request.method,
path: pathname,
correlation_id: reqId,
duration_ms: Date.now() - t0,
},
`${request.method} ${pathname}`,
);
}
return response; return response;
} }

2363
web/package-lock.json generated

File diff suppressed because it is too large Load diff

View file

@ -15,6 +15,7 @@
"@radix-ui/react-tooltip": "^1.1.0", "@radix-ui/react-tooltip": "^1.1.0",
"@react-sigma/core": "^5.0.0", "@react-sigma/core": "^5.0.0",
"@react-sigma/layout-forceatlas2": "^5.0.0", "@react-sigma/layout-forceatlas2": "^5.0.0",
"@sentry/nextjs": "^10.53.1",
"@supabase/ssr": "^0.10.3", "@supabase/ssr": "^0.10.3",
"@supabase/supabase-js": "^2.105.4", "@supabase/supabase-js": "^2.105.4",
"framer-motion": "^11.11.0", "framer-motion": "^11.11.0",
@ -24,6 +25,7 @@
"lucide-react": "^0.460.0", "lucide-react": "^0.460.0",
"next": "^15.1.0", "next": "^15.1.0",
"pg": "^8.13.1", "pg": "^8.13.1",
"pino": "^10.3.1",
"react": "^19.0.0", "react": "^19.0.0",
"react-dom": "^19.0.0", "react-dom": "^19.0.0",
"react-force-graph-2d": "^1.27.0", "react-force-graph-2d": "^1.27.0",

View file

@ -0,0 +1,17 @@
/**
* Sentry (Glitchtip-compatible) client-side init. Loaded by Next.js
* automatically when @sentry/nextjs is installed.
*/
import * as Sentry from "@sentry/nextjs";
const dsn = process.env.NEXT_PUBLIC_SENTRY_DSN;
if (dsn) {
Sentry.init({
dsn,
environment: process.env.NODE_ENV || "development",
tracesSampleRate: 0,
sendDefaultPii: false,
// Capture unhandled promise rejections + JS errors. Glitchtip community
// ignores everything below `error` severity by default.
});
}

View file

@ -0,0 +1,21 @@
/**
* Sentry (Glitchtip-compatible) server-side init.
*
* DSN must point to Glitchtip we never send to sentry.io. See
* SENTRY_DSN / NEXT_PUBLIC_SENTRY_DSN in docker-compose.yml. If unset, the SDK
* is loaded but no events ship safe for local dev.
*/
import * as Sentry from "@sentry/nextjs";
const dsn = process.env.SENTRY_DSN || process.env.NEXT_PUBLIC_SENTRY_DSN;
if (dsn) {
Sentry.init({
dsn,
environment: process.env.NODE_ENV || "development",
release: process.env.SENTRY_RELEASE,
tracesSampleRate: 0, // Glitchtip community doesn't support performance traces
sendDefaultPii: false,
// Make sure events land on Glitchtip's tunnel-friendly DSN host, not
// sentry.io. The SDK already infers from DSN; this is just defensive.
});
}