disclosure-bureau/CLAUDE.md

229 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md — Contrato Vinculante da Wiki UFO/UAP
> Versão `0.1.0` · Última atualização `2026-05-13` · Schema canônico em [`CLAUDE-schema-full.md`](CLAUDE-schema-full.md)
Toda agente que tocar este projeto **lê este arquivo no boot**. Operar lendo apenas este contrato é suficiente para tarefas correntes — detalhes de schema vivem em `CLAUDE-schema-full.md`.
## 1. Filosofia em uma frase
Wiki investigativa estilo **Karpathy LLM Wiki** + **Investigation Bureau** (8 detetives Holmes/Poirot/Dupin/Locard + Schneier/Tetlock/Taleb). Markdown puro, sem RAG, com procedência absoluta de cada claim.
## 2. Layout
```
/Users/guto/ufo/
├── CLAUDE.md ← este arquivo (contrato)
├── CLAUDE-schema-full.md ← schema completo dos 24 tipos
├── raw/ ← IMUTÁVEL (115 PDFs + 14 JPG/PNG)
├── processing/ ← intermediário (PNGs, OCR, vision raw)
├── wiki/ ← GERADO (documents, pages, entities, tables, images)
├── case/ ← Investigation Bureau (evidence, witnesses, hypotheses, ...)
└── scripts/ ← pipelines de ingest, dedup, lint
```
**Regra de ouro:** nada escreve em `raw/`. Referências usam path relativo `../raw/<file>.pdf`.
## 3. Idioma — bilíngue EN + PT-BR (português brasileiro)
A wiki é **bilíngue** desde o ingest. A mesma chamada Haiku vision gera EN e PT-BR juntos (single pass, preserva contexto visual da imagem).
| Categoria de campo | Idioma |
|---|---|
| YAML keys | **English** (international standard) |
| OCR raw text | **Source language only** (verbatim, no translation) |
| `verbatim_excerpt` (evidence), `verbatim_quotes` (person), `caption_ocr` (image) | **Source language only** |
| Enums (`page_type`, `content_classification`, `evidence_grade`, `confidence_band`, redaction codes, classification markings) | **English** (universal) |
| `canonical_name`, technical IDs | **Source language**; aliases array can hold PT-BR forms |
| Narrative descriptions (`vision_description`, `narrative_summary`, `executive_summary`, `description` in gaps, `definition_short` in concepts, `verdict_rationale` in witnesses) | **Both EN and PT-BR** via sibling fields `vision_description` + `vision_description_pt_br` etc. |
| Markdown body sections (headings + commentary) | **Both EN and PT-BR** in adjacent sections: `## Vision Description (EN)` then `## Descrição Vision (PT-BR)` |
**PT-BR rules:**
- Must be **Brazilian Portuguese** (`pt-br`), NOT European Portuguese. Use Brazilian vocabulary and spelling.
- Preserve UTF-8 accents correctly: `ç`, `ã`, `á`, `é`, `í`, `ó`, `ú`, `â`, `ê`, `ô`, `à`. Never strip accents.
- When a verbatim quote from the document appears inside a narrative paragraph, keep the **quote** in source language and translate only the surrounding narration.
- IDs always ASCII-fold (kebab-case without accents). Display fields (`canonical_name`) preserve accents when applicable.
Encoding: **always UTF-8**.
## 4. Os 24 tipos de markdown
| Tipo | Caminho | Owner |
|---|---|---|
| `document` | `wiki/documents/<doc-id>.md` | archivist |
| `page` | `wiki/pages/<doc-id>/p<NNN>.md` | archivist + evidence-officer |
| `person` | `wiki/entities/people/<id>.md` | profiler |
| `organization` | `wiki/entities/organizations/<id>.md` | profiler |
| `location` | `wiki/entities/locations/<id>.md` | archivist |
| `event` | `wiki/entities/events/<id>.md` | timeline-analyst |
| `uap_object` | `wiki/entities/uap-objects/<id>.md` | evidence-officer |
| `vehicle` | `wiki/entities/vehicles/<id>.md` | archivist |
| `operation` | `wiki/entities/operations/<id>.md` | archivist |
| `concept` | `wiki/entities/concepts/<id>.md` | archivist |
| `table` | `wiki/tables/<table-id>.md` | archivist |
| `image` | `wiki/images/<image-id>.md` | evidence-officer |
| `evidence` | `case/evidence/<E-NNNN>.md` | evidence-officer |
| `witness_analysis` | `case/witnesses/<W-NNNN>.md` | witness-officer |
| `timeline` | `case/timelines/<scope>.md` | timeline-analyst |
| `hypothesis` | `case/hypotheses/<H-NNNN>.md` | hypothesis-lead |
| `actor_profile` | `case/profiles/<AP-NNNN>.md` | profiler |
| `gap` | `case/gaps/<G-NNNN>.md` | archivist + chief-detective |
| `relation` | `case/connect-the-dots/<R-NNNN>.md` | chief-detective |
| `case_report` | `case/case-report.md` | case-writer |
| `residual_uncertainty` | `case/residual-uncertainty.md` | chief-detective |
| `index` | `wiki/index.md` | archivist |
| `log` | `wiki/log.md` | archivist (append-only) |
| (este) | `CLAUDE.md` | chief-detective |
Schemas de frontmatter detalhados em [`CLAUDE-schema-full.md`](CLAUDE-schema-full.md).
## 5. Frontmatter obrigatório universal
Todo arquivo `.md` em `wiki/` e `case/` tem:
```yaml
---
schema_version: "0.1.0"
type: <enum> # document | page | person | ... (24 tipos)
canonical_title: "..." # OU canonical_name (entidades)
wiki_version: "0.1.0"
last_ingest: "2026-05-13T14:22:11Z" # OU last_revised
---
```
## 6. Naming canônico (regex)
| Tipo | Regex | Exemplo |
|---|---|---|
| `doc_id` | `^[a-z0-9][a-z0-9-]*$` | `dow-uap-d54-mission-report-mediterranean-sea-na` |
| `page_id` | `^[a-z0-9-]+/p\d{3}$` | `dow-uap-d54-.../p007` |
| `person_id` | `^[a-z][a-z0-9-]*$` (ASCII-fold) | `j-edgar-hoover` |
| `event_id` | `^EV-\d{4}-(\d{2}\|XX)-(\d{2}\|XX)-[a-z0-9-]+$` | `EV-2004-11-14-tic-tac-nimitz` |
| `uap_object_id` | `^OBJ-[A-Z0-9-]+-\d{2}$` | `OBJ-EV2004-NIMITZ-01` |
| `evidence_id` | `^E-\d{4}$` | `E-0042` |
| `witness_id` | `^W-\d{4}$` | `W-0007` |
| `hypothesis_id` | `^H-\d{4}$` | `H-0003` |
| `table_id` | `^TBL-[A-Z0-9]+-\d{4}$` | `TBL-DOWD54-0003` |
| `image_id` | `^IMG-[A-Z0-9]+-p\d{3}-\d{2}$` | `IMG-DOWD54-p007-01` |
| `gap_id` | `^G-\d{4}$` | `G-0012` |
| `relation_id` | `^R-\d{4}$` | `R-0028` |
| `actor_profile_id` | `^AP-\d{4}$` | `AP-0001` |
### Algoritmo `filename → doc_id`
```
1. Strip extension (.pdf, .jpg, .png)
2. NFD + remove combining marks (ASCII fold)
3. Lowercase
4. Replace whitespace/underscore/non-[a-z0-9-] com "-"
5. Collapse "-" repetidos
6. Trim "-" inicial/final
7. Se começa com dígito, prefixa "doc-"
```
## 7. Wiki-links — 18 namespaces
```
[[doc-id]] → wiki/documents/<doc-id>.md
[[doc-id/pNNN]] → wiki/pages/<doc-id>/p<NNN>.md
[[people/<id>]] → wiki/entities/people/<id>.md
[[org/<id>]] → wiki/entities/organizations/<id>.md
[[loc/<id>]] → wiki/entities/locations/<id>.md
[[event/<id>]] → wiki/entities/events/<id>.md
[[uap/<id>]] → wiki/entities/uap-objects/<id>.md
[[vehicle/<id>]] → wiki/entities/vehicles/<id>.md
[[op/<id>]] → wiki/entities/operations/<id>.md
[[concept/<id>]] → wiki/entities/concepts/<id>.md
[[table/<id>]] [[image/<id>]] → wiki/tables|images/<id>.md
[[evidence/<id>]] [[witness/<id>]]
[[hypothesis/<id>]] [[profile/<id>]]
[[gap/<id>]] [[relation/<id>]] → case/...
[[people/...|Grusch]] → custom display text
```
**Backlinks** (`mentioned_in[]` em entidades) são **materializados pelo Lint, NÃO escritos à mão**.
## 8. Confidence calibration (Tetlock)
| Banda | Faixa | Linguagem permitida |
|---|---|---|
| `high` | ≥0.90 | "demonstra", "estabelece" |
| `medium` | 0.600.89 | "sugere fortemente", "indica" |
| `low` | 0.300.59 | "possivelmente", "pode" |
| `speculation` | <0.30 | "hipótese", "especulação" sempre rotulado |
Toda claim em sumário executivo carrega `confidence_band`.
## 9. Classificação de conteúdo (`content_classification`)
Array enum em `document` e `page`:
- `text-only` · `contains-photos` · `contains-sketches` · `contains-diagrams` · `contains-maps` · `contains-tables` · `contains-signatures` · `contains-stamps` · `redaction-heavy` (>30% redacted) · `mixed` · `blank`
Doc-level = união dos valores das páginas.
## 10. Procedência (Locard)
- Toda `evidence` aponta `source_page` + `bbox` (opcional).
- Toda claim em entidade tem `mentioned_in[]` com `page_ref`.
- `chain_of_custody[]` obrigatório em evidence; `custody_gaps[]` explícitos.
- Grade A → ≥3 custody steps · Grade B → ≥2 · Grade C → ≥1
## 11. Operações canônicas
1. **INGEST** — PDF → PNG por página → vision Haiku → `page.md` + entity upsert
2. **LINT** — scan reverso, materializa `mentioned_in[]`, valida wiki-links, reporta orphans
3. **QUERY** — leitura por wiki-link traversal; nunca via embeddings
Log toda operação em `wiki/log.md` (append-only, formato fixo).
## 12. Quality gates (chief-detective enforça)
Threshold global **0.85** em 6 rubrics no `case-report.md`:
1. `chain_of_custody_completeness`
2. `confidence_calibration_match`
3. `hypothesis_tournament_discipline` (≥3 hipóteses)
4. `residual_uncertainty_presence`
5. `audit_trail_per_claim`
6. `red_team_pass`
Lint adicional **bloqueante**:
- Wiki-links resolvem 100%
- `entity.mentioned_in``page.entities_extracted` consistente
- Nenhum `canonical_name` duplicado sem `disambiguation_note`
- `pages[]` contínuo `1..page_count` por documento
## 13. Triggers de enrichment externo
- **≥3 menções OU central claim** → `enrichment_status: deep` (WebSearch + ≥2 `external_sources`)
- **1-2 menções** → `enrichment_status: shallow` (1 query + knowledge interno)
- **0 menções** (inferida) → `enrichment_status: none`
## 14. Idempotência
Re-ingest do mesmo PDF (mesmo `sha256`) atualiza `last_ingest`, preserva `created_at`. Re-lint sobrescreve `mentioned_in[]` mas não duplica.
## 15. Escalation
Agente encontra:
- **Contradição entre evidências grade A/B** → escalar `chief-detective`
- **Hypothesis sobrevivente com posterior >0.70** → revisão multi-detective
- **Gap critical** → criar `[[gap/G-NNNN]]` + linkar em `case-report`
## 16. Modelo
Default para ingest, vision, dedup, lint, enrichment, e geração de markdown: **`claude-haiku-4-5`**.
`case-writer` (narrativa Holmes-Watson final) e `chief-detective` (red team review) podem opcionalmente usar Sonnet para qualidade final.
## 17. Stack de execução
- **PDF → PNG**: `pdftoppm -r 200` (Poppler)
- **PDF → texto**: `pdftotext -layout`
- **Vision**: Anthropic SDK Python + Haiku, com prompt caching e `pdf-2025-03-04` beta header se aplicável
- **Linting**: Python (PyYAML + regex)
Scripts em `/Users/guto/ufo/scripts/`.