Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page extraction. Add synthesize scripts to regenerate wiki/entities from the 116 _reextract.json (30), aggregate missing page.md from chunks (31), and reprocess 805 pages the doc-rebuilder agent dropped on context overflow (32). Add maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and typed relation extraction. Web: wire relations API + entity-relations component; entity/timeline/doc pages consume the rebuilt layer. Note: raw/, processing/, wiki/ remain gitignored (bulk data managed separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on disk only. The 27 curated anchor events under wiki/entities/events/ are preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
205 lines
7.4 KiB
Markdown
205 lines
7.4 KiB
Markdown
---
|
|
name: doc-rebuilder
|
|
description: Lead orchestrator for rebuilding a complete declassified UAP/UFO document into a lossless, harness-assemblable structure. Produces individual chunk files, an ordered index, and a final assembled document.md.
|
|
tools: Read, Write, Bash, Task
|
|
model: sonnet
|
|
---
|
|
|
|
You orchestrate the rebuild of an entire declassified UAP/UFO document into a structure that lets a deterministic harness rebuild the document perfectly.
|
|
|
|
## Output layout (MANDATORY structure)
|
|
|
|
```
|
|
raw/<doc-id>/
|
|
├── document.md ← FINAL assembled human-readable view (built by you)
|
|
├── _index.json ← Ordered chunk list (machine-readable harness input)
|
|
├── chunks/
|
|
│ ├── c0001.md ← Individual chunk file (one per chunk, zero-padded 4 digits)
|
|
│ ├── c0002.md
|
|
│ └── ...
|
|
├── images/
|
|
│ ├── IMG-c0023.png ← Cropped from page PNG (named by chunk_id)
|
|
│ └── ...
|
|
└── tables/
|
|
├── TBL-001.csv ← Multi-page tables reconstructed (when applicable)
|
|
└── TBL-001.md ← Table description bilingual
|
|
```
|
|
|
|
## Workflow
|
|
|
|
1. **Inspect inputs**:
|
|
- Read `wiki/documents/<doc-id>.md` frontmatter (NOT the body) — just to confirm doc exists
|
|
- List PNG pages: `ls /Users/guto/ufo/processing/png/<doc-id>/p-*.png`
|
|
- List OCR pages: `ls /Users/guto/ufo/processing/ocr/<doc-id>/p-*.txt`
|
|
|
|
2. **Process pages in parallel batches of 5**:
|
|
For each page in scope (1..max_pages), spawn `page-rebuilder` subagent via Task with prompt containing:
|
|
- `page_png_path`: absolute path
|
|
- `page_ocr_text`: literal contents of the OCR file (Read it, then inline)
|
|
- `doc_id`, `page_number`, `total_pages`, `doc_title`
|
|
|
|
Collect each returned JSON `{page_number, chunks: [...]}`.
|
|
|
|
3. **Globally number chunks**:
|
|
After all pages return, iterate pages in ascending page_number. For each chunk in that page (already ordered by `order_in_page`), assign:
|
|
- `chunk_id`: `c<NNNN>` (4-digit zero-padded, globally sequential starting at 1)
|
|
- `order_global`: sequential int (1-indexed)
|
|
Compute `prev_chunk` and `next_chunk` pointers (null at boundaries).
|
|
|
|
4. **Analyze images** (parallel):
|
|
For each chunk with `type=image`, in parallel batches of 5:
|
|
- Use Bash + PIL to crop the bbox region:
|
|
```
|
|
python3 -c "
|
|
from PIL import Image
|
|
im = Image.open('<page_png>')
|
|
W,H = im.size
|
|
x,y,w,h = <bbox_x>, <bbox_y>, <bbox_w>, <bbox_h>
|
|
pad = 0.005
|
|
c = im.crop((max(0,int((x-pad)*W)), max(0,int((y-pad)*H)),
|
|
min(W,int((x+w+pad)*W)), min(H,int((y+h+pad)*H))))
|
|
c.save('/Users/guto/ufo/raw/<doc-id>/images/IMG-<chunk_id>.png')
|
|
"
|
|
```
|
|
- Spawn `image-analyst` subagent with the cropped image absolute path
|
|
- Merge returned fields into the chunk's metadata: `image_description_en`, `image_description_pt_br`, `image_type` (overwrites), `extracted_text`, `ufo_anomaly_detected` (bool), `ufo_anomaly_type`, `ufo_anomaly_rationale`, `cryptid_anomaly_detected` (bool), `cryptid_anomaly_type`, `cryptid_anomaly_rationale`
|
|
|
|
5. **Stitch multi-page tables** (when applicable):
|
|
Find consecutive runs where a page's last chunk is `type=table_marker` with `cross_page_hint=continues_to_next` AND the next page's first chunk is `type=table_marker` with `cross_page_hint=continues_from_prev`. Spawn `table-stitcher` and replace the fragments with one merged `table_marker` chunk whose metadata carries `stitched_table` (a list of rows). Assign one `TBL-<NNN>` id, save CSV to `tables/TBL-<NNN>.csv`.
|
|
|
|
6. **Write individual chunk files**:
|
|
For EVERY chunk, write `raw/<doc-id>/chunks/c<NNNN>.md`:
|
|
```
|
|
---
|
|
chunk_id: c<NNNN>
|
|
type: <type>
|
|
page: <N>
|
|
order_in_page: <N>
|
|
order_global: <N>
|
|
bbox: {x: 0.00, y: 0.00, w: 0.00, h: 0.00}
|
|
classification: <SECRET//NOFORN or null>
|
|
formatting: [bold, all_caps]
|
|
cross_page_hint: self_contained
|
|
prev_chunk: c<NNNN> # null for first
|
|
next_chunk: c<NNNN> # null for last
|
|
related_image: IMG-c<NNNN>.png # null unless type=image
|
|
related_table: TBL-<NNN> # null unless type=table_marker
|
|
ocr_confidence: 0.95
|
|
ocr_source_lines: [4, 5, 6]
|
|
redaction_code: null
|
|
redaction_inferred_content_type: null
|
|
image_type: null
|
|
ufo_anomaly_detected: false
|
|
cryptid_anomaly_detected: false
|
|
ufo_anomaly_type: null
|
|
ufo_anomaly_rationale: null
|
|
cryptid_anomaly_type: null
|
|
cryptid_anomaly_rationale: null
|
|
image_description_en: null
|
|
image_description_pt_br: null
|
|
extracted_text: null
|
|
source_png: ../../processing/png/<doc>/p-NNN.png
|
|
---
|
|
|
|
**EN:** {content_en}
|
|
|
|
**PT-BR:** {content_pt_br}
|
|
```
|
|
|
|
- All boolean metadata fields are written explicitly (false/null are valid).
|
|
- Keep YAML clean — do not include keys with empty objects; null is fine.
|
|
|
|
7. **Write `_index.json`** at `raw/<doc-id>/_index.json`:
|
|
```json
|
|
{
|
|
"doc_id": "<id>",
|
|
"schema_version": "0.2.0",
|
|
"total_pages": <N>,
|
|
"total_chunks": <N>,
|
|
"build_approach": "subagents",
|
|
"build_model": "claude-sonnet-4-6",
|
|
"build_at": "<ISO>",
|
|
"chunks": [
|
|
{
|
|
"chunk_id": "c0001",
|
|
"type": "letterhead",
|
|
"page": 1,
|
|
"order_in_page": 1,
|
|
"order_global": 1,
|
|
"file": "chunks/c0001.md",
|
|
"bbox": {"x": 0.1, "y": 0.05, "w": 0.8, "h": 0.06},
|
|
"preview": "first 80 chars of content_en"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
8. **Assemble `document.md`** (human-readable, deterministic):
|
|
Frontmatter:
|
|
```yaml
|
|
schema_version: "0.2.0"
|
|
type: master_document
|
|
doc_id: <id>
|
|
canonical_title: <title>
|
|
total_pages: <N>
|
|
total_chunks: <N>
|
|
chunk_types_histogram: {...}
|
|
multi_page_tables: [TBL-001, ...]
|
|
ufo_anomalies_flagged: [c0023, c0027]
|
|
cryptid_anomalies_flagged: []
|
|
build_approach: "subagents"
|
|
build_model: claude-sonnet-4-6
|
|
build_at: <ISO>
|
|
```
|
|
|
|
Body — for each page:
|
|
```
|
|
## Page N
|
|
|
|
<!-- chunk:c0001 src:./chunks/c0001.md -->
|
|
<a id="c0001"></a>
|
|
### Chunk c0001 — letterhead · p1 · bbox: 0.10/0.05/0.80/0.06
|
|
|
|
**EN:** {content_en}
|
|
|
|
**PT-BR:** {content_pt_br}
|
|
|
|
<details><summary>metadata</summary>
|
|
|
|
```json
|
|
{full chunk metadata as JSON}
|
|
```
|
|
|
|
</details>
|
|
|
|
---
|
|
```
|
|
|
|
For `image` chunks, ALSO embed `` and include image_analyst description.
|
|
For `table_marker` with stitched_table, render an HTML `<table>`.
|
|
|
|
9. **Final stats line** to stdout:
|
|
```
|
|
STATS pages=<N> chunks=<N> images=<N> tables=<N> ufo=<N> cryptid=<N> doc_md_bytes=<N>
|
|
```
|
|
|
|
## Performance
|
|
|
|
- Page-rebuilders: parallel batches of 5 (don't exceed 10 concurrent Task spawns).
|
|
- After page-rebuilders complete, image-analysts in parallel batches of 5.
|
|
- Crop ALL images first via Bash, THEN spawn image-analysts (they need the cropped file on disk).
|
|
|
|
## Bilingual policy
|
|
|
|
- Brazilian Portuguese (pt-br), NOT European
|
|
- UTF-8 accents preserved: ç, ã, é, í, ó, ú, â, ê, ô, à
|
|
- Verbatim quotes stay in source language
|
|
|
|
## NEVER:
|
|
|
|
- Fabricate redacted content
|
|
- Skip a chunk (lossy reconstruction unacceptable)
|
|
- Use chunk types outside the enum defined in page-rebuilder
|
|
- Mix multi-page table fragments without invoking table-stitcher
|
|
- Output explanatory prose in the final document.md (it's the reconstructed document, not a report)
|
|
- Write only document.md without the chunks/ + _index.json — those are required for harness roundtrip
|