disclosure-bureau/case/gaps/G-0001.md
Luiz Gustavo a7e9dce6d2 rebuild entity layer from Sonnet-vision reextract pipeline
Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity
JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page
extraction. Add synthesize scripts to regenerate wiki/entities from the 116
_reextract.json (30), aggregate missing page.md from chunks (31), and reprocess
805 pages the doc-rebuilder agent dropped on context overflow (32). Add
maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and
typed relation extraction.

Web: wire relations API + entity-relations component; entity/timeline/doc
pages consume the rebuilt layer.

Note: raw/, processing/, wiki/ remain gitignored (bulk data managed
separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on
disk only. The 27 curated anchor events under wiki/entities/events/ are
preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 12:20:24 -03:00

4.1 KiB
Raw Permalink Blame History

schema_version type gap_id canonical_title gap_class description description_pt_br detected_in detected_by detected_at severity investigative_impact investigative_impact_pt_br possible_explanations recommended_actions related_gaps wiki_version
0.1.0 gap G-0001 Pages 16 of DOW-UAP-D54 are bit-for-bit identical (same SHA-256) unexplained-redaction Pages 1, 2, 3, 4, 5, and 6 of the PDF `DOW-UAP-D54-Mission-Report-Mediterranean- Sea-NA.pdf` were converted to PNG at 200 DPI and produced six IDENTICAL files, all with SHA-256 `29030fd640030926c9e98e94f73a3fbc88cb9ac6739778b012eba120084ed1b7`. Visually, all six pages show the same image: solid black background (full-page redaction) with only the string "1.4(a)" in red at the top-left corner. The OCR (`pdftotext -layout`) of each page also produces only the text "1.4(a)". The most plausible hypothesis is that during the release process, six originally distinct pages (likely six different blocks of classified content) were ALL replaced by a single redaction template image, instead of each page having its own redaction overlay preserving sub-redacted structure. As páginas 1, 2, 3, 4, 5 e 6 do PDF `DOW-UAP-D54-Mission-Report-Mediterranean- Sea-NA.pdf` foram convertidas em PNG @ 200 DPI e produziram seis arquivos IDÊNTICOS, todos com SHA-256 `29030fd640030926c9e98e94f73a3fbc88cb9ac6739778b012eba120084ed1b7`. Visualmente, as seis páginas mostram a mesma imagem: fundo preto sólido (redação de página inteira) com apenas a string "1.4(a)" em vermelho no canto superior esquerdo. O OCR (`pdftotext -layout`) de cada página produz apenas o texto "1.4(a)". A hipótese mais plausível é que durante o processo de liberação, seis páginas originalmente distintas (provavelmente seis blocos diferentes de conteúdo classificado) foram TODAS substituídas por uma única imagem-template de redação, em vez de cada página ter sua própria sobreposição preservando a estrutura sub-redatada.
dow-uap-d54-mission-report-mediterranean-sea-na/p001
dow-uap-d54-mission-report-mediterranean-sea-na/p002
dow-uap-d54-mission-report-mediterranean-sea-na/p003
dow-uap-d54-mission-report-mediterranean-sea-na/p004
dow-uap-d54-mission-report-mediterranean-sea-na/p005
dow-uap-d54-mission-report-mediterranean-sea-na/p006
archivist 2026-05-13T08:50:00Z medium Substitution by an identical template erases any residual visual structure (margins, headers, paragraph spacing, partial signature blocks) that might permit inference about the redacted content. This compromises forensic analysis of redaction patterns. For other documents in the corpus with partial redactions, it is possible to roughly infer the size/position of removed text — here that is impossible. A substituição por template idêntico apaga qualquer estrutura visual residual (margens, cabeçalhos, espaçamento de parágrafos, blocos parciais de assinatura) que poderia permitir inferência sobre o conteúdo redatado. Compromete a análise forense de padrões de redação. Para outros documentos do corpus com redações parciais é possível inferir aproximadamente o tamanho/posição do texto removido — aqui isso é impossível.
explanation confidence_band
Redaction template applied in bulk for all SECRET-classified pages medium
explanation confidence_band
Bug in release/redaction software that duplicated a single page image low
explanation confidence_band
Deliberate decision to uniform the appearance of fully redacted pages medium
Cross-check other documents in the DOW-UAP-D series to see if the pattern repeats
Compare PDF metadata (xref, font subsetting, image XObject ids) between the 6 pages
Check whether other corpus PDFs with all-redacted pages exhibit the same SHA collision
gap/G-0002
0.1.0

Gap G-0001 — Identical pages in DOW-UAP-D54

Anomaly detected via SHA-256 collision across 6 PNGs derived from 6 distinct pages of the original PDF. See description / description_pt_br in frontmatter.