Commit graph

5 commits

Author SHA1 Message Date
Luiz Gustavo
2ac42b99a7 W5.5 (Phase 3C): Sun-Tzu strategist feeder + entity hero illustrations
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 33s
CI / Scripts — Python smoke (push) Failing after 5s
CI / Web — npm audit (push) Failing after 24s
CI / Retrieval — golden set (Recall@5 + MRR) (push) Failing after 3s
Sun-Tzu (silent backend) — builds the strongest pro-anomaly brief the
corpus supports for any topic. Bilingual JSON: thesis + 2-4 pillars
(each with claim + citation-backed support) + honest residual
unexplained clause. NEVER surfaced reader-facing.

  Migration 0009 (apply as supabase_admin):
    public.pro_anomaly_briefs  brief_pk BIGSERIAL PK
                               brief_id B-NNNN unique
                               topic + topic_pt_br
                               thesis + thesis_pt_br
                               pillars JSONB
                               unexplained + unexplained_pt_br
                               doc_id, job_id, created_by, created_at
    + brief_id_seq sequence
    + GIN trigram indexes on topic + topic_pt_br
    + RLS policies (investigator INSERT, public SELECT)
    + GRANTs on seq + table to investigator

  prompts/sun-tzu.md
    "Adversarial strategist who plays the pro-disclosure side with the
    same rigour a red-team plays skeptic" — single thesis, 2-4 pillars,
    honest residual. Every claim cites a chunk. No fabrication from
    training-time knowledge. Output INTERNAL — case-writer pulls it.
    Bilingual mandatory. NO_STRONG_CASE sentinel when corpus is thin.

  detectives/sun_tzu.ts
    Grounds with hybridSearch top 18 chunks, calls Sonnet, parses
    JSON strict, calls writeProAnomalyBrief.

  tools/write_pro_anomaly_brief.ts
    Validates 2-4 pillars with bilingual claim+support, requires at
    least one [[wiki-link]] citation per pillar, INSERTs.

  orchestrator: new kind "anomaly_brief" dispatches Sun-Tzu.

Case-writer integration (detectives/case_writer.ts):
  - Pulls most recent matching brief via ILIKE on topic or doc_id.
  - Renders brief as a separate prompt section labelled
    "Strategic brief (internal — do NOT cite or attribute)".
  - Instructs the narrator to weave the thesis as a quiet through-
    line, use pillar facts in scenes, let the unexplained clause
    inform the closing paragraph. Forbidden to name "the analyst",
    say "a brief argues", or use the words "thesis"/"pillar"
    explicitly. Translate it into prose.

Entity hero illustrations:
  - 3 painterly editorial illustrations generated via Nano Banana
    Pro at 2K, stored under /data/disclosure/processing/case-art/:
    * EV-1947-06-24-kenneth-arnold-sighting.png — cockpit POV of
      Arnold in a CallAir A-2 over Mount Rainier, 9 chevron disc
      objects in formation, 1947 Life-magazine register.
    * EV-1947-07-08-roswell-incident.png — debris field in NM
      desert, USAAF officer in 1947 uniform examining foil
      fragments, period staff car.
    * EV-1947-06-21-maury-island-incident.png — wooden patrol
      boat on Puget Sound, 6 doughnut craft hovering, one
      shedding glowing slag, Harold Dahl + son + dog watching.
  - app/e/[cls]/[id]/page.tsx: full-bleed editorial hero replaces
    the old gradient header card when an illustration exists for
    that entity_id. Title sits over the painting with gradient
    overlay. "Ilustração editorial" chip in the top-right.

Quota note: Claude OAuth still rate-limited as of this commit, so
Sun-Tzu hasn't been smoke-tested in production. Code is shipped and
ready; first brief will land when the weekly quota refreshes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 16:41:20 -03:00
Luiz Gustavo
55cac8a395 W0+W1+W1.2: security hardening, observability, autocomplete, glitchtip, forgejo CI
Some checks failed
CI / Web — typecheck + lint + build (push) Failing after 1m30s
CI / Scripts — Python smoke (push) Failing after 32s
CI / Web — npm audit (push) Failing after 37s
W0 — security hardening (5 fixes verified live on disclosure.top)
- middleware: gate /api/admin/* same as /admin/* (F1)
- imgproxy: tighten LOCAL_FILESYSTEM_ROOT from / to /var/lib/storage (F2)
- studio: real basic-auth label (bcrypt hash, middleware reference) (F3)
- relations: ENABLE ROW LEVEL SECURITY + public SELECT policy (F4)
- migration 0003: fold is_searchable + hybrid_search update into canonical (TD#2)

W1 — observability + resilience + autocomplete
- studio: HOSTNAME=0.0.0.0 so Next.js binds on loopback for healthcheck
- compose: PG_POOL_MAX=20, CLAUDE_CODE_OAUTH_TOKEN gated by separate env
- claude-code.ts: subprocess timeout configurable (CLAUDE_CODE_TIMEOUT_MS)
- openrouter.ts: retry with exponential backoff + Retry-After + in-memory
  circuit breaker (promotes FALLBACK after CB_THRESHOLD failures)
- lib/logger.ts: pino logger (NDJSON prod / pretty dev) + withRequest helper
- middleware: mints correlation_id, stamps x-correlation-id response header,
  emits structured http_request log per /api/* call
- messages/route.ts: switch to structured logger
- 60_meili_index.py: push documents + chunks into Meilisearch
- /api/search/autocomplete: parallel meili search (docs + chunks), 5-8ms p50
- search-autocomplete.tsx: debounced dropdown wired into search-panel

W1.2 — Glitchtip + Forgejo self-hosted
- compose: glitchtip-redis + glitchtip-web + glitchtip-worker (v4.2)
- compose: forgejo + forgejo-runner (server v9, runner v6) with group_add=988
- @sentry/nextjs SDK wired (instrumentation.ts + sentry.{client,server}.config.ts)
- /api/admin/throw smoke endpoint (gated by W0-F1 middleware)
- Synthetic event ingestion verified at glitchtip.disclosure.top
- forgejo.disclosure.top up, repo discadmin/disclosure-bureau created,
  runner registered (labels: ubuntu-latest, docker)
- .forgejo/workflows/ci.yml: typecheck + lint + build + npm audit + python
  syntax + compose validation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:18:42 -03:00
Luiz Gustavo
a7e9dce6d2 rebuild entity layer from Sonnet-vision reextract pipeline
Add reextract pipeline (scripts/reextract/) that rebuilds doc-level entity
JSON from Sonnet-vision chunks via Opus, replacing the noisy per-page
extraction. Add synthesize scripts to regenerate wiki/entities from the 116
_reextract.json (30), aggregate missing page.md from chunks (31), and reprocess
805 pages the doc-rebuilder agent dropped on context overflow (32). Add
maintain scripts 43-56 for chunk-page sync, dedup, generic-entity marking, and
typed relation extraction.

Web: wire relations API + entity-relations component; entity/timeline/doc
pages consume the rebuilt layer.

Note: raw/, processing/, wiki/ remain gitignored (bulk data managed
separately); the 116 reextract JSONs and 7,798 rebuilt entity files live on
disk only. The 27 curated anchor events under wiki/entities/events/ are
preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 12:20:24 -03:00
guto
291748df63 sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.

Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.

scripts/maintain/42_sync_entity_stats.py walks every entity and writes:

  mentioned_in:        [...]        # consolidated page refs
  total_mentions:      max(db, pages)
  documents_count:     max(db_docs, distinct page docs)
  signal_sources:
    db_chunks:         int
    page_refs:         int
    cross_refs:        int
  signal_strength:     strong | weak | orphan | unverified
  referenced_by:       [[class/id]]  # cross-entity backlinks

Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.

OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.

Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.

web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.

DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.

First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 19:49:31 -03:00
guto
19d0678e55 baseline: Disclosure Bureau pipeline + Next.js UI + Supabase stack 2026-05-17 22:44:36 -03:00