disclosure-bureau/investigator-runtime/src/detectives/holmes.ts

191 lines
7.1 KiB
TypeScript
Raw Normal View History

W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
/**
* holmes.ts hypothesis tournament detective.
*
* Workflow (matches agentic-layer-spec sec 7):
* 1. The runtime grounds Holmes with a small corpus shortlist via
* hybridSearch Holmes never gets the whole DB, just the relevant 8-15
* chunks.
* 2. Claude Sonnet 4.6 reads the question + chunks, emits a JSON array of
* 2-3 rival hypotheses with priors/posteriors/citations.
* 3. The runtime parses the array and calls writeHypothesis() for each.
* The writer enforces posterior bounds + Tetlock band + FK to evidence.
*
* Holmes does NOT get tool calls. All grounding is pre-fed; all writes are
* applied by the runtime after validation (sa-security gate #2).
*/
import { readFile } from "node:fs/promises";
import path from "node:path";
import { fileURLToPath } from "node:url";
import { audit } from "../lib/audit";
import { callClaude } from "../lib/claude";
import { env } from "../lib/env";
import { hybridSearch, type SearchHit } from "../lib/search";
import { writeHypothesis, type WriteHypothesisArgs } from "../tools/write_hypothesis";
const HERE = path.dirname(fileURLToPath(import.meta.url));
const PROMPT_PATH = path.resolve(HERE, "..", "..", "prompts", "holmes.md");
export interface HolmesTask {
job_id: string;
question: string;
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
/** Optional PT-BR mirror of the question. If omitted, the EN one is used
* for both sides until the model emits PT-BR output. */
question_pt_br?: string;
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
/** Optional scope narrowing — restrict the search to one doc / entity. */
doc_id?: string;
lang?: "pt" | "en";
/** How many chunks to feed Holmes. Default 12. */
context_chunks?: number;
budget_cap_usd?: number;
}
function renderChunkBlock(hits: SearchHit[], lang: "pt" | "en"): string {
const blocks = hits.map((h, i) => {
const text = (lang === "en" ? h.content_en : h.content_pt) || h.content_en || h.content_pt || "";
const pageStr = String(h.page).padStart(3, "0");
return [
`--- chunk ${i + 1} ---`,
`id: [[${h.doc_id}/p${pageStr}#${h.chunk_id}]]`,
`type: ${h.type}`,
h.classification ? `classification: ${h.classification}` : null,
"",
text.slice(0, 1200),
].filter(Boolean).join("\n");
});
return blocks.join("\n\n");
}
function buildPrompt(task: HolmesTask, hits: SearchHit[], lang: "pt" | "en"): string {
const block = renderChunkBlock(hits, lang);
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
const ptQ = task.question_pt_br?.trim();
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
return [
`# Question to investigate`,
"",
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
`**EN.** ${task.question}`,
ptQ ? `**PT-BR.** ${ptQ}` : null,
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
"",
`## Corpus shortlist (${hits.length} chunks${task.doc_id ? `, scoped to ${task.doc_id}` : ""})`,
"",
block,
"",
"## Your task",
"",
"Build 2-3 rival hypotheses about the question above. Each must cite at",
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
"least one chunk via [[doc-id/pNNN#cNNNN]] in both argument_for and",
"argument_against (EN) and in argument_for_pt_br and",
"argument_against_pt_br (PT-BR). Assign priors + posteriors summing",
"roughly to 1.0. Emit the JSON array exactly as specified by the system",
"prompt — no prose, no code fence, no preamble. **Bilingual is mandatory:",
"every narrative field appears in both EN and PT-BR.**",
].filter(Boolean).join("\n");
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
}
function extractJsonArray(text: string): unknown[] | null {
const t = text.trim();
W3.8: Investigation Bureau complete — Poirot, Taleb, Tetlock, Case-Writer Brings the bureau from 4 → 8 detectives. All eight run as Bun + claude-CLI subprocesses against the same Supabase + investigation_jobs LISTEN/NOTIFY queue, sharing search.ts hybridSearch and writer-side validators that gate writes against schema + FK. New detectives: Poirot (witness_analysis) - prompts/poirot.md — credibility / access / bias / corroboration / verdict; uses entity_mentions JOIN chunks to pull 12 chunks per person; resolves corroboration_refs chunk_ids defensively (accepts bare cNNNN even when the model emits pNNN/cNNNN). - INSERT into public.witnesses with W-NNNN naming. - Tone: purple (#9b5de5). Taleb (outlier_scan) - prompts/taleb.md — "surprise is relative to a model"; at most 3 outliers; each requires explicit dominant_model + why_surprising + what_it_implies; fan-out into public.gaps with scope.kind="outlier". - Same unscoped-fallback as Dupin (Pass 1 with doc_id, Pass 2 widens to corpus if hits < 3). - Tone: yellow (#ffd23f). Tetlock (calibrate_hypothesis) - prompts/tetlock.md — honest Bayesian update; emits new_posterior + Δ + recommended_action ∈ {keep, downgrade, upgrade, supersede}. - write_calibration UPDATEs public.hypotheses + APPENDS a "## Calibration history" section to the H-NNNN.md case file (calibration is append-only — each datapoint matters). Posterior band auto-corrected to match Tetlock thresholds. - NO_NEW_EVIDENCE sentinel handled; pure 'keep' with |Δ|<0.005 only touches updated_at + reviewed_by. - Tone: teal (#26d4cc). Case-Writer (case_report) - prompts/case-writer.md — Dr. Watson assembles all artefacts (E-NNNN, H-NNNN, R-NNNN, W-NNNN, G-NNNN) into a five-act narrative. ILIKE filter on topic; doc_id optional scope. - Larger budget cap (≥ $0.50) + longer timeout for prose generation. - Writes case/reports/<slug>.md with frontmatter (topic + counts); no DB table for v0. - New page /c/[slug] renders the report via MarkdownBody + stat chips. - Tone: gold (#e0c080). Hardening across the bureau: - Sentinel parsing now accepts backticked AND prose-trailing forms (Holmes NO_HYPOTHESES, Dupin NO_CONTRADICTIONS, Schneier INSUFFICIENT_HYPOTHESIS, Poirot INSUFFICIENT_TESTIMONY, Taleb NO_OUTLIERS, Tetlock NO_NEW_EVIDENCE, Case-Writer INSUFFICIENT_ARTEFACTS). Avoids the failure mode where the model refuses honestly but the runtime treated it as a parse error (observed live with Poirot+Hoover identifying the DIRECTOR false-positive disambiguation issue in entity_mentions). Chat tool extensions (web/lib/chat/tools.ts): - request_investigation now accepts 7 kinds. Each routes to its detective with appropriate validation (hypothesis_id regex, person_id kebab-case, topic non-empty, doc_id for evidence_chain). - ETA per kind: Holmes/Dupin 60s, Poirot 45s, Schneier/Tetlock 30s, Taleb 50s, Case-Writer 180s (longer prose), Locard 30×n_chunks. UI integration: - chat-bubble inline card paints each detective in its tone color. - /jobs/[id] page header swaps name/subtitle/tone per detective; question label adapts ("Topic" / "Hypothesis under attack" / "Witness under analysis" / "Topic to outlier-scan" / "Hypothesis under recalibration" / "Case to assemble"). - job-status-poller renders: case-report link card (gold), outlier cards (yellow), witness cards (purple) — alongside existing hypothesis, evidence, contradiction cards. - /api/jobs/[id] hydrates witnesses (JOIN entities for canonical_name) + gaps (with scope JSONB). - /c/[slug] page reads /data/ufo/case/reports/<slug>.md and renders with MarkdownBody, frontmatter parsed for stat chips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 01:11:39 +00:00
if (/^`?NO_HYPOTHESES`?\b/i.test(t)) return null;
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
const stripped = t.replace(/^```(?:json)?\s*\n?/i, "").replace(/\n?```\s*$/i, "");
const first = stripped.indexOf("[");
const last = stripped.lastIndexOf("]");
if (first === -1 || last === -1) {
throw new Error(`holmes returned no JSON array: ${t.slice(0, 200)}`);
}
const parsed = JSON.parse(stripped.slice(first, last + 1));
if (!Array.isArray(parsed)) throw new Error("holmes JSON is not an array");
return parsed;
}
export async function runHolmes(task: HolmesTask): Promise<
| { hypotheses: Array<{ hypothesis_id: string; case_file: string }> }
| { skipped: true; reason: string }
> {
const lang: "pt" | "en" = task.lang ?? "pt";
const k = task.context_chunks ?? 12;
// 1. Ground with hybrid_search.
const hits = await hybridSearch({
query: task.question,
lang,
doc_id: task.doc_id ?? null,
top_k: k,
recall_k: 60,
});
await audit({
event: "holmes_grounded",
job_id: task.job_id,
detective: "holmes@detective",
question: task.question,
n_chunks: hits.length,
doc_id: task.doc_id ?? null,
});
if (hits.length === 0) {
return { skipped: true, reason: "no_corpus_match" };
}
// 2. Call Claude.
const systemPrompt = await readFile(PROMPT_PATH, "utf-8");
const prompt = buildPrompt(task, hits, lang);
const llm = await callClaude({
prompt,
systemPrompt,
model: env.CLAUDE_MODEL,
allowedTools: [],
timeoutMs: env.JOB_TIMEOUT_SECONDS * 1000,
budgetCapUsd: task.budget_cap_usd ?? env.BUDGET_CAP_USD_PER_JOB,
});
await audit({
event: "detective_completed",
job_id: task.job_id,
detective: "holmes@detective",
cost_usd: llm.costUsd,
tokens_in: llm.tokensIn,
tokens_out: llm.tokensOut,
duration_ms: llm.durationMs,
});
console.error(`[holmes] response (${llm.text.length} chars): ${llm.text.slice(0, 800)}`);
// 3. Parse + write.
const arr = extractJsonArray(llm.text);
if (arr === null) return { skipped: true, reason: "NO_HYPOTHESES" };
const out: Array<{ hypothesis_id: string; case_file: string }> = [];
for (const raw of arr.slice(0, 3)) {
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
const r = raw as Record<string, unknown>;
const strOrUndef = (k: string): string | undefined =>
typeof r[k] === "string" && (r[k] as string).trim().length > 0
? (r[k] as string).trim() : undefined;
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
const args: WriteHypothesisArgs = {
W4: bilingual EN + PT-BR Investigation Bureau (CLAUDE.md §3 contract) User flagged that the bureau was emitting English-only output, violating the project's bilingual rule. Every narrative field now ships in both languages: stored in sibling DB columns + rendered as adjacent markdown sections per CLAUDE.md §3. Migration 0007 (apply as supabase_admin): - public.hypotheses +question_pt_br, +position_pt_br, +argument_for_pt_br, +argument_against_pt_br - public.contradictions +topic_pt_br, +notes_pt_br - public.witnesses +access_to_event_pt_br, +bias_notes_pt_br, +verdict_pt_br - public.gaps +description_pt_br, +suggested_next_move_pt_br - public.evidence: unchanged (verbatim_excerpt stays source-language) - JSONB siblings inside contradictions.chunks + gaps.scope handled at runtime (statement_pt_br, title_pt_br, dominant_model_pt_br, why_surprising_pt_br, what_it_implies_pt_br). Detective prompts (all 7) rewritten with explicit bilingual JSON contract: - Output protocol section names every EN field + its _pt_br sibling - "Bilingual is mandatory" warning in the task instruction - Sentinel skip-states unchanged (NO_HYPOTHESES, NO_CONTRADICTIONS, INSUFFICIENT_TESTIMONY, INSUFFICIENT_HYPOTHESIS, NO_OUTLIERS, NO_NEW_EVIDENCE, INSUFFICIENT_ARTEFACTS) - Schneier: parallel arrays — hidden_assumptions[i] matches hidden_assumptions_pt_br[i], lengths must match - Case-Writer: interleaved §1 (EN) / §1 (PT-BR) per act in the body Writer-side validation (all 7 tools): - Reject INSERT if PT-BR sibling missing when EN field is set - Persist both languages atomically in one INSERT (no half-updates) - Markdown renderers write adjacent EN+PT-BR sections in case files (## Argument for (EN) followed by ## Argumento a favor (PT-BR), etc.) Detective parse layer (all 7 detectives): - Coerce both keys from JSON output - "incomplete_bilingual_*" skip reason when either side missing - Defensive: PT-BR fields trimmed + length-capped same as EN Orchestrator propagates question_pt_br + topic_pt_br through job payload to runHolmes / runCaseWriter, mirroring the chat-tool entry point. Web (UI): - /api/jobs/[id] hydrates _pt_br siblings from pg - job-status-poller HypothesisCard: PT-BR primary, EN in <details> fallback when both exist - ContradictionCard: PT-BR statement primary + secondary EN quote - WitnessCard: PT-BR verdict primary + secondary EN quote, panels in PT - GapCard: PT-BR title/why/implies primary - /bureau hub: SELECTs both columns, renders PT-BR primary - /h/[id]: ArgumentPanel renders PT-BR primary with collapsible EN fallback when both exist - BureauSnapshot homepage: position_pt_br / topic_pt_br / verdict_pt_br primary - DocBureauPanel /d/[doc]: same primary-PT-BR pattern - New web/lib/i18n/pick.ts helper (unused yet by chat/agents — kept for future locale-driven switching when both languages are equally full; current rule is PT-BR-first since the user is brasileiro) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 15:02:59 +00:00
question: task.question,
question_pt_br: task.question_pt_br ?? task.question,
position: String(r.position ?? "").trim(),
position_pt_br: strOrUndef("position_pt_br"),
argument_for: strOrUndef("argument_for"),
argument_for_pt_br: strOrUndef("argument_for_pt_br"),
argument_against: strOrUndef("argument_against"),
argument_against_pt_br: strOrUndef("argument_against_pt_br"),
prior: Number(r.prior),
posterior: Number(r.posterior),
confidence_band: r.confidence_band as WriteHypothesisArgs["confidence_band"],
evidence_refs: Array.isArray(r.evidence_refs)
? (r.evidence_refs as Array<{ evidence_id?: string; supports?: boolean; weight?: number }>)
.filter((x): x is { evidence_id: string; supports?: boolean; weight?: number } =>
typeof x?.evidence_id === "string" && x.evidence_id.length > 0)
W3.5: Holmes hypothesis tournament detective Adds the second AI detective in the Investigation Bureau runtime: Sherlock Holmes, who builds 2-3 rival hypotheses with calibrated priors + posteriors against a corpus shortlist. Pipeline: 1. hybridSearch() grounds Holmes with 8-15 chunks via the same hybrid_search_chunks RPC the web uses (BM25 + dense + RRF). Default max_dense_dist=0.55 (runtime favors recall over precision; web's /api/search/hybrid stays at 0.40 for chat). 2. claude-sonnet-4-6 emits a strict JSON array with position + argument_for + argument_against + prior + posterior + confidence_band + evidence_refs. Citations use [[doc-id/pNNN#cNNNN]] wiki-links. 3. writeHypothesis() validates posterior ∈ [0,1], auto-corrects the Tetlock band from the posterior (high ≥0.90, medium 0.60-0.89, low 0.30-0.59, speculation <0.30), checks evidence_refs FK against public.evidence, INSERTs into public.hypotheses + writes case/hypotheses/H-NNNN.md. Discipline guarantees (prompts/holmes.md): - posteriors across rivals sum to ≈1.0 - no claim without chunk citation - prefer lower band when ambiguous (anti-inflation) - declarative one-sentence position, no hedging - emit `NO_HYPOTHESES` when corpus is silent (refuses to fabricate) Smoke test (Sandia green fireballs 1948-49): - H-0001 prior 0.5 → posterior 0.2 (speculation): natural meteoric - H-0002 prior 0.3 → posterior 0.4 (low): classified weapons / tests - H-0003 prior 0.2 → posterior 0.4 (low): genuinely unidentified Bayesian update visible: "natural meteoric" prior dropped 60%; both rivals climbed. 4 unique chunk citations across the 3 hypotheses. orchestrator dispatches `hypothesis_tournament` kind via runHolmes; job marked `failed` if all rivals error, `complete` otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:19:43 +00:00
: [],
};
if (!args.position) continue;
try {
const r = await writeHypothesis(args, { job_id: task.job_id, detective: "holmes@detective" });
out.push(r);
} catch (e) {
await audit({
event: "write_hypothesis_failed",
job_id: task.job_id,
detective: "holmes@detective",
error: (e as Error).message,
position: args.position.slice(0, 200),
});
}
}
return { hypotheses: out };
}