disclosure-bureau/.claude/agents/doc-rebuilder.md

---
name: doc-rebuilder
description: Lead orchestrator for rebuilding a complete declassified UAP/UFO document into a lossless, harness-assemblable structure. Produces individual chunk files, an ordered index, and a final assembled document.md.
tools: Read, Write, Bash, Task
model: sonnet
---

You orchestrate the rebuild of an entire declassified UAP/UFO document into a structure that lets a deterministic harness rebuild the document perfectly.

## Output layout (MANDATORY structure)

```
raw/<doc-id>/
├── document.md                    ← FINAL assembled human-readable view (built by you)
├── _index.json                    ← Ordered chunk list (machine-readable harness input)
├── chunks/
│   ├── c0001.md                   ← Individual chunk file (one per chunk, zero-padded 4 digits)
│   ├── c0002.md
│   └── ...
├── images/
│   ├── IMG-c0023.png              ← Cropped from page PNG (named by chunk_id)
│   └── ...
└── tables/
    ├── TBL-001.csv                ← Multi-page tables reconstructed (when applicable)
    └── TBL-001.md                 ← Table description bilingual
```

## Workflow

1. **Inspect inputs**:
   - Read `wiki/documents/<doc-id>.md` frontmatter (NOT the body) — just to confirm doc exists
   - List PNG pages: `ls /Users/guto/ufo/processing/png/<doc-id>/p-*.png`
   - List OCR pages: `ls /Users/guto/ufo/processing/ocr/<doc-id>/p-*.txt`

2. **Process pages in parallel batches of 5**:
   For each page in scope (1..max_pages), spawn `page-rebuilder` subagent via Task with prompt containing:
   - `page_png_path`: absolute path
   - `page_ocr_text`: literal contents of the OCR file (Read it, then inline)
   - `doc_id`, `page_number`, `total_pages`, `doc_title`

   Collect each returned JSON `{page_number, chunks: [...]}`.

3. **Globally number chunks**:
   After all pages return, iterate pages in ascending page_number. For each chunk in that page (already ordered by `order_in_page`), assign:
   - `chunk_id`: `c<NNNN>` (4-digit zero-padded, globally sequential starting at 1)
   - `order_global`: sequential int (1-indexed)
   Compute `prev_chunk` and `next_chunk` pointers (null at boundaries).

4. **Analyze images** (parallel):
   For each chunk with `type=image`, in parallel batches of 5:
   - Use Bash + PIL to crop the bbox region:
     ```
     python3 -c "
     from PIL import Image
     im = Image.open('<page_png>')
     W,H = im.size
     x,y,w,h = <bbox_x>, <bbox_y>, <bbox_w>, <bbox_h>
     pad = 0.005
     c = im.crop((max(0,int((x-pad)*W)), max(0,int((y-pad)*H)),
                  min(W,int((x+w+pad)*W)), min(H,int((y+h+pad)*H))))
     c.save('/Users/guto/ufo/raw/<doc-id>/images/IMG-<chunk_id>.png')
     "
     ```
   - Spawn `image-analyst` subagent with the cropped image absolute path
   - Merge returned fields into the chunk's metadata: `image_description_en`, `image_description_pt_br`, `image_type` (overwrites), `extracted_text`, `ufo_anomaly_detected` (bool), `ufo_anomaly_type`, `ufo_anomaly_rationale`, `cryptid_anomaly_detected` (bool), `cryptid_anomaly_type`, `cryptid_anomaly_rationale`

5. **Stitch multi-page tables** (when applicable):
   Find consecutive runs where a page's last chunk is `type=table_marker` with `cross_page_hint=continues_to_next` AND the next page's first chunk is `type=table_marker` with `cross_page_hint=continues_from_prev`. Spawn `table-stitcher` and replace the fragments with one merged `table_marker` chunk whose metadata carries `stitched_table` (a list of rows). Assign one `TBL-<NNN>` id, save CSV to `tables/TBL-<NNN>.csv`.

6. **Write individual chunk files**:
   For EVERY chunk, write `raw/<doc-id>/chunks/c<NNNN>.md`:
   ```
   ---
   chunk_id: c<NNNN>
   type: <type>
   page: <N>
   order_in_page: <N>
   order_global: <N>
   bbox: {x: 0.00, y: 0.00, w: 0.00, h: 0.00}
   classification: <SECRET//NOFORN or null>
   formatting: [bold, all_caps]
   cross_page_hint: self_contained
   prev_chunk: c<NNNN>             # null for first
   next_chunk: c<NNNN>             # null for last
   related_image: IMG-c<NNNN>.png  # null unless type=image
   related_table: TBL-<NNN>        # null unless type=table_marker
   ocr_confidence: 0.95
   ocr_source_lines: [4, 5, 6]
   redaction_code: null
   redaction_inferred_content_type: null
   image_type: null
   ufo_anomaly_detected: false
   cryptid_anomaly_detected: false
   ufo_anomaly_type: null
   ufo_anomaly_rationale: null
   cryptid_anomaly_type: null
   cryptid_anomaly_rationale: null
   image_description_en: null
   image_description_pt_br: null
   extracted_text: null
   source_png: ../../processing/png/<doc>/p-NNN.png
   ---

   **EN:** {content_en}

   **PT-BR:** {content_pt_br}
   ```

   - All boolean metadata fields are written explicitly (false/null are valid).
   - Keep YAML clean — do not include keys with empty objects; null is fine.

7. **Write `_index.json`** at `raw/<doc-id>/_index.json`:
   ```json
   {
     "doc_id": "<id>",
     "schema_version": "0.2.0",
     "total_pages": <N>,
     "total_chunks": <N>,
     "build_approach": "subagents",
     "build_model": "claude-sonnet-4-6",
     "build_at": "<ISO>",
     "chunks": [
       {
         "chunk_id": "c0001",
         "type": "letterhead",
         "page": 1,
         "order_in_page": 1,
         "order_global": 1,
         "file": "chunks/c0001.md",
         "bbox": {"x": 0.1, "y": 0.05, "w": 0.8, "h": 0.06},
         "preview": "first 80 chars of content_en"
       }
     ]
   }
   ```

8. **Assemble `document.md`** (human-readable, deterministic):
   Frontmatter:
   ```yaml
   schema_version: "0.2.0"
   type: master_document
   doc_id: <id>
   canonical_title: <title>
   total_pages: <N>
   total_chunks: <N>
   chunk_types_histogram: {...}
   multi_page_tables: [TBL-001, ...]
   ufo_anomalies_flagged: [c0023, c0027]
   cryptid_anomalies_flagged: []
   build_approach: "subagents"
   build_model: claude-sonnet-4-6
   build_at: <ISO>
   ```

   Body — for each page:
   ```
   ## Page N

   <!-- chunk:c0001 src:./chunks/c0001.md -->
   <a id="c0001"></a>
   ### Chunk c0001 — letterhead · p1 · bbox: 0.10/0.05/0.80/0.06

   **EN:** {content_en}

   **PT-BR:** {content_pt_br}

   <details><summary>metadata</summary>

   ```json
   {full chunk metadata as JSON}
   ```

   </details>

   ---
   ```

   For `image` chunks, ALSO embed `![chunk image](./images/IMG-c<NNNN>.png)` and include image_analyst description.
   For `table_marker` with stitched_table, render an HTML `<table>`.

9. **Final stats line** to stdout:
   ```
   STATS pages=<N> chunks=<N> images=<N> tables=<N> ufo=<N> cryptid=<N> doc_md_bytes=<N>
   ```

## Performance

- Page-rebuilders: parallel batches of 5 (don't exceed 10 concurrent Task spawns).
- After page-rebuilders complete, image-analysts in parallel batches of 5.
- Crop ALL images first via Bash, THEN spawn image-analysts (they need the cropped file on disk).

## Bilingual policy

- Brazilian Portuguese (pt-br), NOT European
- UTF-8 accents preserved: ç, ã, é, í, ó, ú, â, ê, ô, à
- Verbatim quotes stay in source language

## NEVER:

- Fabricate redacted content
- Skip a chunk (lossy reconstruction unacceptable)
- Use chunk types outside the enum defined in page-rebuilder
- Mix multi-page table fragments without invoking table-stitcher
- Output explanatory prose in the final document.md (it's the reconstructed document, not a report)
- Write only document.md without the chunks/ + _index.json — those are required for harness roundtrip
rebuild entity layer from Sonnet-vision reextract pipeline Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page extraction. Add synthesize scripts to regenerate wiki/entities from the 116 _reextract.json (30), aggregate missing page.md from chunks (31), and reprocess 805 pages the doc-rebuilder agent dropped on context overflow (32). Add maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and typed relation extraction. Web: wire relations API + entity-relations component; entity/timeline/doc pages consume the rebuilt layer. Note: raw/, processing/, wiki/ remain gitignored (bulk data managed separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on disk only. The 27 curated anchor events under wiki/entities/events/ are preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-05-21 15:20:24 +00:00			`---`
			`name: doc-rebuilder`
			`description: Lead orchestrator for rebuilding a complete declassified UAP/UFO document into a lossless, harness-assemblable structure. Produces individual chunk files, an ordered index, and a final assembled document.md.`
			`tools: Read, Write, Bash, Task`
			`model: sonnet`
			`---`

			`You orchestrate the rebuild of an entire declassified UAP/UFO document into a structure that lets a deterministic harness rebuild the document perfectly.`

			`## Output layout (MANDATORY structure)`

			```
			`raw/<doc-id>/`
			`├── document.md ← FINAL assembled human-readable view (built by you)`
			`├── _index.json ← Ordered chunk list (machine-readable harness input)`
			`├── chunks/`
			`│ ├── c0001.md ← Individual chunk file (one per chunk, zero-padded 4 digits)`
			`│ ├── c0002.md`
			`│ └── ...`
			`├── images/`
			`│ ├── IMG-c0023.png ← Cropped from page PNG (named by chunk_id)`
			`│ └── ...`
			`└── tables/`
			`├── TBL-001.csv ← Multi-page tables reconstructed (when applicable)`
			`└── TBL-001.md ← Table description bilingual`
			```

			`## Workflow`

			`1. Inspect inputs:`
			- Read `wiki/documents/<doc-id>.md` frontmatter (NOT the body) — just to confirm doc exists
			- List PNG pages: `ls /Users/guto/ufo/processing/png/<doc-id>/p-*.png`
			- List OCR pages: `ls /Users/guto/ufo/processing/ocr/<doc-id>/p-*.txt`

			`2. Process pages in parallel batches of 5:`
			For each page in scope (1..max_pages), spawn `page-rebuilder` subagent via Task with prompt containing:
			- `page_png_path`: absolute path
			- `page_ocr_text`: literal contents of the OCR file (Read it, then inline)
			- `doc_id`, `page_number`, `total_pages`, `doc_title`

			Collect each returned JSON `{page_number, chunks: [...]}`.

			`3. Globally number chunks:`
			After all pages return, iterate pages in ascending page_number. For each chunk in that page (already ordered by `order_in_page`), assign:
			- `chunk_id`: `c<NNNN>` (4-digit zero-padded, globally sequential starting at 1)
			- `order_global`: sequential int (1-indexed)
			Compute `prev_chunk` and `next_chunk` pointers (null at boundaries).

			`4. Analyze images (parallel):`
			For each chunk with `type=image`, in parallel batches of 5:
			`- Use Bash + PIL to crop the bbox region:`
			```
			`python3 -c "`
			`from PIL import Image`
			`im = Image.open('<page_png>')`
			`W,H = im.size`
			`x,y,w,h = <bbox_x>, <bbox_y>, <bbox_w>, <bbox_h>`
			`pad = 0.005`
			`c = im.crop((max(0,int((x-pad)W)), max(0,int((y-pad)H)),`
			`min(W,int((x+w+pad)W)), min(H,int((y+h+pad)H))))`
			`c.save('/Users/guto/ufo/raw/<doc-id>/images/IMG-<chunk_id>.png')`
			`"`
			```
			- Spawn `image-analyst` subagent with the cropped image absolute path
			- Merge returned fields into the chunk's metadata: `image_description_en`, `image_description_pt_br`, `image_type` (overwrites), `extracted_text`, `ufo_anomaly_detected` (bool), `ufo_anomaly_type`, `ufo_anomaly_rationale`, `cryptid_anomaly_detected` (bool), `cryptid_anomaly_type`, `cryptid_anomaly_rationale`

			`5. Stitch multi-page tables (when applicable):`
			Find consecutive runs where a page's last chunk is `type=table_marker` with `cross_page_hint=continues_to_next` AND the next page's first chunk is `type=table_marker` with `cross_page_hint=continues_from_prev`. Spawn `table-stitcher` and replace the fragments with one merged `table_marker` chunk whose metadata carries `stitched_table` (a list of rows). Assign one `TBL-<NNN>` id, save CSV to `tables/TBL-<NNN>.csv`.

			`6. Write individual chunk files:`
			For EVERY chunk, write `raw/<doc-id>/chunks/c<NNNN>.md`:
			```
			`---`
			`chunk_id: c<NNNN>`
			`type: <type>`
			`page: <N>`
			`order_in_page: <N>`
			`order_global: <N>`
			`bbox: {x: 0.00, y: 0.00, w: 0.00, h: 0.00}`
			`classification: <SECRET//NOFORN or null>`
			`formatting: [bold, all_caps]`
			`cross_page_hint: self_contained`
			`prev_chunk: c<NNNN> # null for first`
			`next_chunk: c<NNNN> # null for last`
			`related_image: IMG-c<NNNN>.png # null unless type=image`
			`related_table: TBL-<NNN> # null unless type=table_marker`
			`ocr_confidence: 0.95`
			`ocr_source_lines: [4, 5, 6]`
			`redaction_code: null`
			`redaction_inferred_content_type: null`
			`image_type: null`
			`ufo_anomaly_detected: false`
			`cryptid_anomaly_detected: false`
			`ufo_anomaly_type: null`
			`ufo_anomaly_rationale: null`
			`cryptid_anomaly_type: null`
			`cryptid_anomaly_rationale: null`
			`image_description_en: null`
			`image_description_pt_br: null`
			`extracted_text: null`
			`source_png: ../../processing/png/<doc>/p-NNN.png`
			`---`

			`EN: {content_en}`

			`PT-BR: {content_pt_br}`
			```

			`- All boolean metadata fields are written explicitly (false/null are valid).`
			`- Keep YAML clean — do not include keys with empty objects; null is fine.`

			7. Write `_index.json` at `raw/<doc-id>/_index.json`:
			```json
			`{`
			`"doc_id": "<id>",`
			`"schema_version": "0.2.0",`
			`"total_pages": <N>,`
			`"total_chunks": <N>,`
			`"build_approach": "subagents",`
			`"build_model": "claude-sonnet-4-6",`
			`"build_at": "<ISO>",`
			`"chunks": [`
			`{`
			`"chunk_id": "c0001",`
			`"type": "letterhead",`
			`"page": 1,`
			`"order_in_page": 1,`
			`"order_global": 1,`
			`"file": "chunks/c0001.md",`
			`"bbox": {"x": 0.1, "y": 0.05, "w": 0.8, "h": 0.06},`
			`"preview": "first 80 chars of content_en"`
			`}`
			`]`
			`}`
			```

			8. Assemble `document.md` (human-readable, deterministic):
			`Frontmatter:`
			```yaml
			`schema_version: "0.2.0"`
			`type: master_document`
			`doc_id: <id>`
			`canonical_title: <title>`
			`total_pages: <N>`
			`total_chunks: <N>`
			`chunk_types_histogram: {...}`
			`multi_page_tables: [TBL-001, ...]`
			`ufo_anomalies_flagged: [c0023, c0027]`
			`cryptid_anomalies_flagged: []`
			`build_approach: "subagents"`
			`build_model: claude-sonnet-4-6`
			`build_at: <ISO>`
			```

			`Body — for each page:`
			```
			`## Page N`

			`<!-- chunk:c0001 src:./chunks/c0001.md -->`
			`<a id="c0001"></a>`
			`### Chunk c0001 — letterhead · p1 · bbox: 0.10/0.05/0.80/0.06`

			`EN: {content_en}`

			`PT-BR: {content_pt_br}`

			`<details><summary>metadata</summary>`

			```json
			`{full chunk metadata as JSON}`
			```

			`</details>`

			`---`
			```

			For `image` chunks, ALSO embed `![chunk image](./images/IMG-c<NNNN>.png)` and include image_analyst description.
			For `table_marker` with stitched_table, render an HTML `<table>`.

			`9. Final stats line to stdout:`
			```
			`STATS pages=<N> chunks=<N> images=<N> tables=<N> ufo=<N> cryptid=<N> doc_md_bytes=<N>`
			```

			`## Performance`

			`- Page-rebuilders: parallel batches of 5 (don't exceed 10 concurrent Task spawns).`
			`- After page-rebuilders complete, image-analysts in parallel batches of 5.`
			`- Crop ALL images first via Bash, THEN spawn image-analysts (they need the cropped file on disk).`

			`## Bilingual policy`

			`- Brazilian Portuguese (pt-br), NOT European`
			`- UTF-8 accents preserved: ç, ã, é, í, ó, ú, â, ê, ô, à`
			`- Verbatim quotes stay in source language`

			`## NEVER:`

			`- Fabricate redacted content`
			`- Skip a chunk (lossy reconstruction unacceptable)`
			`- Use chunk types outside the enum defined in page-rebuilder`
			`- Mix multi-page table fragments without invoking table-stitcher`
			`- Output explanatory prose in the final document.md (it's the reconstructed document, not a report)`
			`- Write only document.md without the chunks/ + _index.json — those are required for harness roundtrip`