Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page extraction. Add synthesize scripts to regenerate wiki/entities from the 116 _reextract.json (30), aggregate missing page.md from chunks (31), and reprocess 805 pages the doc-rebuilder agent dropped on context overflow (32). Add maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and typed relation extraction. Web: wire relations API + entity-relations component; entity/timeline/doc pages consume the rebuilt layer. Note: raw/, processing/, wiki/ remain gitignored (bulk data managed separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on disk only. The 27 curated anchor events under wiki/entities/events/ are preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
79 lines
4.1 KiB
Markdown
79 lines
4.1 KiB
Markdown
---
|
||
schema_version: "0.1.0"
|
||
type: gap
|
||
gap_id: "G-0001"
|
||
canonical_title: "Pages 1–6 of DOW-UAP-D54 are bit-for-bit identical (same SHA-256)"
|
||
gap_class: unexplained-redaction
|
||
|
||
description: |
|
||
Pages 1, 2, 3, 4, 5, and 6 of the PDF `DOW-UAP-D54-Mission-Report-Mediterranean-
|
||
Sea-NA.pdf` were converted to PNG at 200 DPI and produced six IDENTICAL files,
|
||
all with SHA-256 `29030fd640030926c9e98e94f73a3fbc88cb9ac6739778b012eba120084ed1b7`.
|
||
|
||
Visually, all six pages show the same image: solid black background (full-page
|
||
redaction) with only the string "1.4(a)" in red at the top-left corner. The OCR
|
||
(`pdftotext -layout`) of each page also produces only the text "1.4(a)".
|
||
|
||
The most plausible hypothesis is that during the release process, six originally
|
||
distinct pages (likely six different blocks of classified content) were ALL
|
||
replaced by a single redaction template image, instead of each page having its
|
||
own redaction overlay preserving sub-redacted structure.
|
||
|
||
description_pt_br: |
|
||
As páginas 1, 2, 3, 4, 5 e 6 do PDF `DOW-UAP-D54-Mission-Report-Mediterranean-
|
||
Sea-NA.pdf` foram convertidas em PNG @ 200 DPI e produziram seis arquivos
|
||
IDÊNTICOS, todos com SHA-256 `29030fd640030926c9e98e94f73a3fbc88cb9ac6739778b012eba120084ed1b7`.
|
||
|
||
Visualmente, as seis páginas mostram a mesma imagem: fundo preto sólido (redação
|
||
de página inteira) com apenas a string "1.4(a)" em vermelho no canto superior
|
||
esquerdo. O OCR (`pdftotext -layout`) de cada página produz apenas o texto "1.4(a)".
|
||
|
||
A hipótese mais plausível é que durante o processo de liberação, seis páginas
|
||
originalmente distintas (provavelmente seis blocos diferentes de conteúdo
|
||
classificado) foram TODAS substituídas por uma única imagem-template de redação,
|
||
em vez de cada página ter sua própria sobreposição preservando a estrutura
|
||
sub-redatada.
|
||
|
||
detected_in:
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p001]]"
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p002]]"
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p003]]"
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p004]]"
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p005]]"
|
||
- "[[dow-uap-d54-mission-report-mediterranean-sea-na/p006]]"
|
||
detected_by: archivist
|
||
detected_at: "2026-05-13T08:50:00Z"
|
||
|
||
severity: medium
|
||
investigative_impact: |
|
||
Substitution by an identical template erases any residual visual structure
|
||
(margins, headers, paragraph spacing, partial signature blocks) that might
|
||
permit inference about the redacted content. This compromises forensic
|
||
analysis of redaction patterns. For other documents in the corpus with
|
||
partial redactions, it is possible to roughly infer the size/position of
|
||
removed text — here that is impossible.
|
||
investigative_impact_pt_br: |
|
||
A substituição por template idêntico apaga qualquer estrutura visual residual
|
||
(margens, cabeçalhos, espaçamento de parágrafos, blocos parciais de assinatura)
|
||
que poderia permitir inferência sobre o conteúdo redatado. Compromete a análise
|
||
forense de padrões de redação. Para outros documentos do corpus com redações
|
||
parciais é possível inferir aproximadamente o tamanho/posição do texto
|
||
removido — aqui isso é impossível.
|
||
|
||
possible_explanations:
|
||
- { explanation: "Redaction template applied in bulk for all SECRET-classified pages", confidence_band: medium }
|
||
- { explanation: "Bug in release/redaction software that duplicated a single page image", confidence_band: low }
|
||
- { explanation: "Deliberate decision to uniform the appearance of fully redacted pages", confidence_band: medium }
|
||
|
||
recommended_actions:
|
||
- "Cross-check other documents in the DOW-UAP-D series to see if the pattern repeats"
|
||
- "Compare PDF metadata (xref, font subsetting, image XObject ids) between the 6 pages"
|
||
- "Check whether other corpus PDFs with all-redacted pages exhibit the same SHA collision"
|
||
|
||
related_gaps: ["[[gap/G-0002]]"]
|
||
wiki_version: "0.1.0"
|
||
---
|
||
|
||
# Gap G-0001 — Identical pages in DOW-UAP-D54
|
||
|
||
Anomaly detected via SHA-256 collision across 6 PNGs derived from 6 distinct pages of the original PDF. See `description` / `description_pt_br` in frontmatter.
|