Commit graph

2 commits

Author SHA1 Message Date
Luiz Gustavo
4d4c02a8e1 W3.5: Holmes hypothesis tournament detective
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 34s
CI / Scripts — Python smoke (push) Failing after 3s
CI / Web — npm audit (push) Failing after 29s
CI / Retrieval — golden set (Recall@5 + MRR) (push) Failing after 3s
Adds the second AI detective in the Investigation Bureau runtime: Sherlock
Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors
against a corpus shortlist.

Pipeline:
  1. hybridSearch() grounds Holmes with 8-15 chunks via the same
     hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default
     max_dense_dist=0.55 (runtime favors recall over precision; web's
     /api/search/hybrid stays at 0.40 for chat).
  2. claude-sonnet-4-6 emits a strict JSON array with position +
     argument_for + argument_against + prior + posterior + confidence_band
     + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links.
  3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the
     Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89,
     low 0.30-0.59, speculation <0.30), checks evidence_refs FK against
     public.evidence, INSERTs into public.hypotheses + writes
     case/hypotheses/H-NNNN.md.

Discipline guarantees (prompts/holmes.md):
  - posteriors across rivals sum to ≈1.0
  - no claim without chunk citation
  - prefer lower band when ambiguous (anti-inflation)
  - declarative one-sentence position, no hedging
  - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate)

Smoke test (Sandia green fireballs 1948-49):
  - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric
  - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests
  - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified
  Bayesian update visible: "natural meteoric" prior dropped 60%; both
  rivals climbed. 4 unique chunk citations across the 3 hypotheses.

orchestrator dispatches `hypothesis_tournament` kind via runHolmes;
job marked `failed` if all rivals error, `complete` otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 21:19:43 -03:00
Luiz Gustavo
189a771cbe W3.1-W3.4: Investigation Bureau foundation — migrations, runtime, Locard
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 38s
CI / Scripts — Python smoke (push) Failing after 3s
CI / Web — npm audit (push) Failing after 33s
CI / Retrieval — golden set (Recall@5 + MRR) (push) Failing after 4s
Migrations:
- 0004_investigation_bureau.sql: 7 new tables (investigation_jobs + evidence,
  hypotheses, contradictions, witnesses, gaps, residual_uncertainties), id
  sequences, pg_notify trigger on investigation_jobs, RLS read-only public,
  investigator role with least-privilege grants (no service_role).
- 0005_investigator_write_policies.sql: fixup adding RLS INSERT/UPDATE
  policies bound to investigator + service_role + postgres (RLS with only a
  SELECT policy was silently blocking the worker's claim UPDATE).

investigator-runtime/ (new Bun + TS container):
- src/main.ts: LISTEN/NOTIFY poller, claim-with-SKIP-LOCKED, drain pool,
  healthcheck file, graceful SIGTERM shutdown.
- src/orchestrator.ts: chief-detective dispatch (evidence_chain → Locard).
  Marks job failed when all per-item outputs error; surfaces first errors.
- src/lib/{env,pg,audit,ids,claude}.ts: typed config (gate #8), pool +
  dedicated LISTEN client, NDJSON audit, sequence allocator (E-NNNN etc),
  claude -p subprocess with quota detection (api_error_status=429).
- src/tools/write_evidence.ts: schema-validate (grade A/B/C custody steps),
  resolve chunk_pk via FK, verify verbatim_excerpt actually appears in
  chunk content, INSERT + render case/evidence/E-NNNN.md + audit.
- src/detectives/locard.ts: load chunk → call Claude with locard.md system
  prompt → parse strict JSON → call writeEvidence locally.
- Dockerfile installs `claude` CLI (OAuth) at build time.

Compose:
- new `investigator` service builds from investigator-runtime/, connects
  with low-privilege role, mounts case/ RW and wiki/+raw/ RO, 512m mem cap.

Web:
- /api/admin/investigate/test (POST+GET) gated by middleware (W0-F1).
  POST creates a job, GET polls status. For W3.6 it becomes the chat tool.

End-to-end smoke: INSERT job → pg_notify → claim → Locard dispatch →
claude subprocess invoked. Auth works (CLI v2.1.150). Currently quota
exhausted (weekly limit · resets 3pm UTC) — pipeline catches the typed
isQuota error, marks job failed with surfaced reason. Architecture proven;
quota reset enables real evidence creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:49:33 -03:00