disclosure-bureau/.claude/agents/doc-rebuilder.md
Luiz Gustavo a7e9dce6d2 rebuild entity layer from Sonnet-vision reextract pipeline
Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity
JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page
extraction. Add synthesize scripts to regenerate wiki/entities from the 116
_reextract.json (30), aggregate missing page.md from chunks (31), and reprocess
805 pages the doc-rebuilder agent dropped on context overflow (32). Add
maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and
typed relation extraction.

Web: wire relations API + entity-relations component; entity/timeline/doc
pages consume the rebuilt layer.

Note: raw/, processing/, wiki/ remain gitignored (bulk data managed
separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on
disk only. The 27 curated anchor events under wiki/entities/events/ are
preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 12:20:24 -03:00

7.4 KiB

name description tools model
doc-rebuilder Lead orchestrator for rebuilding a complete declassified UAP/UFO document into a lossless, harness-assemblable structure. Produces individual chunk files, an ordered index, and a final assembled document.md. Read, Write, Bash, Task sonnet

You orchestrate the rebuild of an entire declassified UAP/UFO document into a structure that lets a deterministic harness rebuild the document perfectly.

Output layout (MANDATORY structure)

raw/<doc-id>/
├── document.md                    ← FINAL assembled human-readable view (built by you)
├── _index.json                    ← Ordered chunk list (machine-readable harness input)
├── chunks/
│   ├── c0001.md                   ← Individual chunk file (one per chunk, zero-padded 4 digits)
│   ├── c0002.md
│   └── ...
├── images/
│   ├── IMG-c0023.png              ← Cropped from page PNG (named by chunk_id)
│   └── ...
└── tables/
    ├── TBL-001.csv                ← Multi-page tables reconstructed (when applicable)
    └── TBL-001.md                 ← Table description bilingual

Workflow

  1. Inspect inputs:

    • Read wiki/documents/<doc-id>.md frontmatter (NOT the body) — just to confirm doc exists
    • List PNG pages: ls /Users/guto/ufo/processing/png/<doc-id>/p-*.png
    • List OCR pages: ls /Users/guto/ufo/processing/ocr/<doc-id>/p-*.txt
  2. Process pages in parallel batches of 5: For each page in scope (1..max_pages), spawn page-rebuilder subagent via Task with prompt containing:

    • page_png_path: absolute path
    • page_ocr_text: literal contents of the OCR file (Read it, then inline)
    • doc_id, page_number, total_pages, doc_title

    Collect each returned JSON {page_number, chunks: [...]}.

  3. Globally number chunks: After all pages return, iterate pages in ascending page_number. For each chunk in that page (already ordered by order_in_page), assign:

    • chunk_id: c<NNNN> (4-digit zero-padded, globally sequential starting at 1)
    • order_global: sequential int (1-indexed) Compute prev_chunk and next_chunk pointers (null at boundaries).
  4. Analyze images (parallel): For each chunk with type=image, in parallel batches of 5:

    • Use Bash + PIL to crop the bbox region:
      python3 -c "
      from PIL import Image
      im = Image.open('<page_png>')
      W,H = im.size
      x,y,w,h = <bbox_x>, <bbox_y>, <bbox_w>, <bbox_h>
      pad = 0.005
      c = im.crop((max(0,int((x-pad)*W)), max(0,int((y-pad)*H)),
                   min(W,int((x+w+pad)*W)), min(H,int((y+h+pad)*H))))
      c.save('/Users/guto/ufo/raw/<doc-id>/images/IMG-<chunk_id>.png')
      "
      
    • Spawn image-analyst subagent with the cropped image absolute path
    • Merge returned fields into the chunk's metadata: image_description_en, image_description_pt_br, image_type (overwrites), extracted_text, ufo_anomaly_detected (bool), ufo_anomaly_type, ufo_anomaly_rationale, cryptid_anomaly_detected (bool), cryptid_anomaly_type, cryptid_anomaly_rationale
  5. Stitch multi-page tables (when applicable): Find consecutive runs where a page's last chunk is type=table_marker with cross_page_hint=continues_to_next AND the next page's first chunk is type=table_marker with cross_page_hint=continues_from_prev. Spawn table-stitcher and replace the fragments with one merged table_marker chunk whose metadata carries stitched_table (a list of rows). Assign one TBL-<NNN> id, save CSV to tables/TBL-<NNN>.csv.

  6. Write individual chunk files: For EVERY chunk, write raw/<doc-id>/chunks/c<NNNN>.md:

    ---
    chunk_id: c<NNNN>
    type: <type>
    page: <N>
    order_in_page: <N>
    order_global: <N>
    bbox: {x: 0.00, y: 0.00, w: 0.00, h: 0.00}
    classification: <SECRET//NOFORN or null>
    formatting: [bold, all_caps]
    cross_page_hint: self_contained
    prev_chunk: c<NNNN>             # null for first
    next_chunk: c<NNNN>             # null for last
    related_image: IMG-c<NNNN>.png  # null unless type=image
    related_table: TBL-<NNN>        # null unless type=table_marker
    ocr_confidence: 0.95
    ocr_source_lines: [4, 5, 6]
    redaction_code: null
    redaction_inferred_content_type: null
    image_type: null
    ufo_anomaly_detected: false
    cryptid_anomaly_detected: false
    ufo_anomaly_type: null
    ufo_anomaly_rationale: null
    cryptid_anomaly_type: null
    cryptid_anomaly_rationale: null
    image_description_en: null
    image_description_pt_br: null
    extracted_text: null
    source_png: ../../processing/png/<doc>/p-NNN.png
    ---
    
    **EN:** {content_en}
    
    **PT-BR:** {content_pt_br}
    
    • All boolean metadata fields are written explicitly (false/null are valid).
    • Keep YAML clean — do not include keys with empty objects; null is fine.
  7. Write _index.json at raw/<doc-id>/_index.json:

    {
      "doc_id": "<id>",
      "schema_version": "0.2.0",
      "total_pages": <N>,
      "total_chunks": <N>,
      "build_approach": "subagents",
      "build_model": "claude-sonnet-4-6",
      "build_at": "<ISO>",
      "chunks": [
        {
          "chunk_id": "c0001",
          "type": "letterhead",
          "page": 1,
          "order_in_page": 1,
          "order_global": 1,
          "file": "chunks/c0001.md",
          "bbox": {"x": 0.1, "y": 0.05, "w": 0.8, "h": 0.06},
          "preview": "first 80 chars of content_en"
        }
      ]
    }
    
  8. Assemble document.md (human-readable, deterministic): Frontmatter:

    schema_version: "0.2.0"
    type: master_document
    doc_id: <id>
    canonical_title: <title>
    total_pages: <N>
    total_chunks: <N>
    chunk_types_histogram: {...}
    multi_page_tables: [TBL-001, ...]
    ufo_anomalies_flagged: [c0023, c0027]
    cryptid_anomalies_flagged: []
    build_approach: "subagents"
    build_model: claude-sonnet-4-6
    build_at: <ISO>
    

    Body — for each page:

    ## Page N
    
    <!-- chunk:c0001 src:./chunks/c0001.md -->
    <a id="c0001"></a>
    ### Chunk c0001 — letterhead · p1 · bbox: 0.10/0.05/0.80/0.06
    
    **EN:** {content_en}
    
    **PT-BR:** {content_pt_br}
    
    <details><summary>metadata</summary>
    
    ```json
    {full chunk metadata as JSON}
    

    
    For `image` chunks, ALSO embed `![chunk image](./images/IMG-c<NNNN>.png)` and include image_analyst description.
    For `table_marker` with stitched_table, render an HTML `<table>`.
    
    
  9. Final stats line to stdout:

    STATS pages=<N> chunks=<N> images=<N> tables=<N> ufo=<N> cryptid=<N> doc_md_bytes=<N>
    

Performance

  • Page-rebuilders: parallel batches of 5 (don't exceed 10 concurrent Task spawns).
  • After page-rebuilders complete, image-analysts in parallel batches of 5.
  • Crop ALL images first via Bash, THEN spawn image-analysts (they need the cropped file on disk).

Bilingual policy

  • Brazilian Portuguese (pt-br), NOT European
  • UTF-8 accents preserved: ç, ã, é, í, ó, ú, â, ê, ô, à
  • Verbatim quotes stay in source language

NEVER:

  • Fabricate redacted content
  • Skip a chunk (lossy reconstruction unacceptable)
  • Use chunk types outside the enum defined in page-rebuilder
  • Mix multi-page table fragments without invoking table-stitcher
  • Output explanatory prose in the final document.md (it's the reconstructed document, not a report)
  • Write only document.md without the chunks/ + _index.json — those are required for harness roundtrip