2026-05-18 01:44:36 +00:00
|
|
|
/**
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
* Entity page data — SINGLE SOURCE OF TRUTH is the YAML frontmatter on disk.
|
|
|
|
|
*
|
|
|
|
|
* Why YAML and not the DB? Because the corpus has TWO independent extraction
|
|
|
|
|
* layers (Haiku page-level, Sonnet chunk-level) and each catches a different
|
|
|
|
|
* subset of entities. The DB's entity_mentions table is one of those signals —
|
|
|
|
|
* useful for chat retrieval but incomplete for the entity catalog itself.
|
|
|
|
|
*
|
|
|
|
|
* Reading from disk lets us merge every signal into one stat (`total_mentions`)
|
|
|
|
|
* via the maintain/42_sync_entity_stats.py pipeline and serve consistent
|
|
|
|
|
* numbers everywhere in the UI.
|
|
|
|
|
*
|
|
|
|
|
* The DB is still queried for ONE thing: the actual chunk text for previews,
|
|
|
|
|
* because we don't want to re-parse 21k chunk files on every page render.
|
2026-05-18 01:44:36 +00:00
|
|
|
*/
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
import fs from "node:fs/promises";
|
|
|
|
|
import path from "node:path";
|
|
|
|
|
import matter from "gray-matter";
|
2026-05-18 01:44:36 +00:00
|
|
|
import { pgQuery } from "./db";
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
import { WIKI } from "@/lib/wiki";
|
|
|
|
|
|
|
|
|
|
const FOLDER_BY_CLASS: Record<string, string> = {
|
|
|
|
|
person: "people",
|
|
|
|
|
organization: "organizations",
|
|
|
|
|
location: "locations",
|
|
|
|
|
event: "events",
|
|
|
|
|
uap_object: "uap-objects",
|
|
|
|
|
vehicle: "vehicles",
|
|
|
|
|
operation: "operations",
|
|
|
|
|
concept: "concepts",
|
|
|
|
|
};
|
2026-05-18 01:44:36 +00:00
|
|
|
|
|
|
|
|
export interface EntityCore {
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
entity_pk: number | null; // db-side primary key; null if entity is wiki-only
|
2026-05-18 01:44:36 +00:00
|
|
|
entity_class: string;
|
|
|
|
|
entity_id: string;
|
|
|
|
|
canonical_name: string;
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
aliases: string[];
|
2026-05-18 01:44:36 +00:00
|
|
|
total_mentions: number;
|
|
|
|
|
documents_count: number;
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
signal_strength: "strong" | "weak" | "orphan" | "unverified";
|
|
|
|
|
signal_sources: {
|
|
|
|
|
db_chunks: number;
|
|
|
|
|
page_refs: number;
|
|
|
|
|
cross_refs: number;
|
|
|
|
|
};
|
|
|
|
|
mentioned_in: string[]; // [[doc-id/p007]]
|
|
|
|
|
referenced_by: string[]; // [[class/id]] cross-links
|
2026-05-18 01:44:36 +00:00
|
|
|
enrichment_status: string | null;
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
narrative_summary: string | null;
|
|
|
|
|
narrative_summary_pt_br: string | null;
|
|
|
|
|
summary_status: string | null;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
interface RawFm {
|
|
|
|
|
[k: string]: unknown;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function num(v: unknown, fallback = 0): number {
|
|
|
|
|
if (typeof v === "number" && Number.isFinite(v)) return v;
|
|
|
|
|
if (typeof v === "string") {
|
|
|
|
|
const n = Number(v);
|
|
|
|
|
return Number.isFinite(n) ? n : fallback;
|
|
|
|
|
}
|
|
|
|
|
return fallback;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function arr(v: unknown): string[] {
|
|
|
|
|
if (!v) return [];
|
|
|
|
|
if (Array.isArray(v)) return v.filter((x) => typeof x === "string");
|
|
|
|
|
return [];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function strOrNull(v: unknown): string | null {
|
|
|
|
|
return typeof v === "string" && v.trim() ? v : null;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
async function readEntityYaml(entityClass: string, entityId: string): Promise<RawFm | null> {
|
|
|
|
|
const folder = FOLDER_BY_CLASS[entityClass];
|
|
|
|
|
if (!folder) return null;
|
|
|
|
|
const p = path.join(WIKI, "entities", folder, `${entityId}.md`);
|
|
|
|
|
try {
|
|
|
|
|
const raw = await fs.readFile(p, "utf-8");
|
|
|
|
|
return matter(raw).data as RawFm;
|
|
|
|
|
} catch {
|
|
|
|
|
return null;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Load a single entity card from its YAML. Returns null if archived or
|
|
|
|
|
* missing — keeps the route handler simple.
|
|
|
|
|
*/
|
|
|
|
|
export async function getEntityCore(
|
|
|
|
|
entityClass: string,
|
|
|
|
|
entityId: string,
|
|
|
|
|
): Promise<EntityCore | null> {
|
|
|
|
|
const fm = await readEntityYaml(entityClass, entityId);
|
|
|
|
|
if (!fm) return null;
|
|
|
|
|
|
|
|
|
|
// Best-effort lookup of the DB entity_pk so getEntityChunks can still
|
|
|
|
|
// query by primary key. Don't fail if the entity isn't in the DB at all.
|
|
|
|
|
let entity_pk: number | null = null;
|
|
|
|
|
try {
|
|
|
|
|
const rows = await pgQuery<{ entity_pk: number }>(
|
|
|
|
|
`SELECT entity_pk FROM public.entities
|
|
|
|
|
WHERE entity_class = $1 AND entity_id = $2 LIMIT 1`,
|
|
|
|
|
[entityClass, entityId],
|
|
|
|
|
);
|
|
|
|
|
entity_pk = rows[0]?.entity_pk ?? null;
|
|
|
|
|
} catch {
|
|
|
|
|
entity_pk = null;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
const sigSources = (fm.signal_sources as Record<string, unknown> | undefined) ?? {};
|
|
|
|
|
const strength = (typeof fm.signal_strength === "string"
|
|
|
|
|
? fm.signal_strength
|
|
|
|
|
: "unverified") as EntityCore["signal_strength"];
|
|
|
|
|
|
|
|
|
|
return {
|
|
|
|
|
entity_pk,
|
|
|
|
|
entity_class: entityClass,
|
|
|
|
|
entity_id: entityId,
|
|
|
|
|
canonical_name: typeof fm.canonical_name === "string" ? fm.canonical_name : entityId,
|
|
|
|
|
aliases: arr(fm.aliases),
|
|
|
|
|
total_mentions: num(fm.total_mentions, 0),
|
|
|
|
|
documents_count: num(fm.documents_count, 0),
|
|
|
|
|
signal_strength: ["strong", "weak", "orphan", "unverified"].includes(strength)
|
|
|
|
|
? strength
|
|
|
|
|
: "unverified",
|
|
|
|
|
signal_sources: {
|
|
|
|
|
db_chunks: num(sigSources.db_chunks, 0),
|
|
|
|
|
page_refs: num(sigSources.page_refs, 0),
|
|
|
|
|
cross_refs: num(sigSources.cross_refs, 0),
|
|
|
|
|
},
|
|
|
|
|
mentioned_in: arr(fm.mentioned_in),
|
|
|
|
|
referenced_by: arr(fm.referenced_by),
|
|
|
|
|
enrichment_status: strOrNull(fm.enrichment_status),
|
|
|
|
|
narrative_summary: strOrNull(fm.narrative_summary),
|
|
|
|
|
narrative_summary_pt_br: strOrNull(fm.narrative_summary_pt_br),
|
|
|
|
|
summary_status: strOrNull(fm.summary_status),
|
|
|
|
|
};
|
2026-05-18 01:44:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
export interface EntityMentionGroup {
|
|
|
|
|
doc_id: string;
|
|
|
|
|
canonical_title: string | null;
|
|
|
|
|
collection: string | null;
|
|
|
|
|
page_count: number | null;
|
|
|
|
|
classification: string | null;
|
|
|
|
|
mention_count: number;
|
|
|
|
|
pages: number[];
|
|
|
|
|
}
|
|
|
|
|
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
/**
|
|
|
|
|
* Group reverse-references by document. Derived from the YAML's mentioned_in[]
|
|
|
|
|
* (which the maintain script writes consolidating page YAMLs). Optionally
|
|
|
|
|
* enriches with document metadata read from wiki/documents/<doc-id>.md.
|
|
|
|
|
*/
|
|
|
|
|
export async function getEntityMentionsByDoc(
|
2026-05-18 01:44:36 +00:00
|
|
|
entityClass: string,
|
|
|
|
|
entityId: string,
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
limit = 100,
|
2026-05-18 01:44:36 +00:00
|
|
|
): Promise<EntityMentionGroup[]> {
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
const fm = await readEntityYaml(entityClass, entityId);
|
|
|
|
|
if (!fm) return [];
|
|
|
|
|
const refs = arr(fm.mentioned_in);
|
|
|
|
|
// Each ref looks like "[[doc-id/p007]]". Strip wikilink delimiters.
|
|
|
|
|
const byDoc = new Map<string, Set<number>>();
|
|
|
|
|
for (const ref of refs) {
|
|
|
|
|
const m = ref.match(/\[\[([^\]|]+?)\]\]/);
|
|
|
|
|
const target = (m ? m[1] : ref).trim();
|
|
|
|
|
const [docId, pageStr] = target.split("/", 2);
|
|
|
|
|
if (!docId) continue;
|
|
|
|
|
const pageNum = pageStr ? parseInt(pageStr.replace(/^p/, ""), 10) : NaN;
|
|
|
|
|
if (!byDoc.has(docId)) byDoc.set(docId, new Set());
|
|
|
|
|
if (Number.isFinite(pageNum)) byDoc.get(docId)!.add(pageNum);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Hydrate each doc's metadata from wiki/documents/<doc-id>.md
|
|
|
|
|
const groups: EntityMentionGroup[] = [];
|
|
|
|
|
for (const [docId, pages] of byDoc) {
|
|
|
|
|
let canonical_title: string | null = null;
|
|
|
|
|
let collection: string | null = null;
|
|
|
|
|
let page_count: number | null = null;
|
|
|
|
|
let classification: string | null = null;
|
|
|
|
|
try {
|
|
|
|
|
const docRaw = await fs.readFile(
|
|
|
|
|
path.join(WIKI, "documents", `${docId}.md`),
|
|
|
|
|
"utf-8",
|
|
|
|
|
);
|
|
|
|
|
const dfm = matter(docRaw).data as Record<string, unknown>;
|
|
|
|
|
canonical_title = strOrNull(dfm.canonical_title);
|
|
|
|
|
collection = strOrNull(dfm.collection);
|
|
|
|
|
page_count = num(dfm.page_count, 0) || null;
|
|
|
|
|
classification = strOrNull(dfm.highest_classification);
|
|
|
|
|
} catch {
|
|
|
|
|
/* doc missing — use raw id */
|
|
|
|
|
}
|
|
|
|
|
groups.push({
|
|
|
|
|
doc_id: docId,
|
|
|
|
|
canonical_title,
|
|
|
|
|
collection,
|
|
|
|
|
page_count,
|
|
|
|
|
classification,
|
|
|
|
|
mention_count: pages.size,
|
|
|
|
|
pages: Array.from(pages).sort((a, b) => a - b),
|
|
|
|
|
});
|
|
|
|
|
}
|
|
|
|
|
groups.sort((a, b) => b.mention_count - a.mention_count);
|
|
|
|
|
return groups.slice(0, limit);
|
2026-05-18 01:44:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
export interface EntityChunkPreview {
|
|
|
|
|
chunk_pk: number;
|
|
|
|
|
doc_id: string;
|
|
|
|
|
chunk_id: string;
|
|
|
|
|
page: number;
|
|
|
|
|
type: string;
|
|
|
|
|
bbox: { x: number; y: number; w: number; h: number } | null;
|
|
|
|
|
classification: string | null;
|
|
|
|
|
content_pt: string | null;
|
|
|
|
|
content_en: string | null;
|
|
|
|
|
ufo_anomaly: boolean | null;
|
|
|
|
|
ufo_anomaly_type: string | null;
|
|
|
|
|
}
|
|
|
|
|
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
/**
|
|
|
|
|
* Top chunks that textually mention this entity. Reads from DB because
|
|
|
|
|
* chunk content is big (we don't re-parse files at request time). Returns []
|
|
|
|
|
* if the entity isn't indexed in the DB.
|
|
|
|
|
*/
|
2026-05-18 01:44:36 +00:00
|
|
|
export async function getEntityChunks(
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
entityPk: number | null,
|
|
|
|
|
limit = 30,
|
2026-05-18 01:44:36 +00:00
|
|
|
): Promise<EntityChunkPreview[]> {
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
if (entityPk == null) return [];
|
2026-05-18 01:44:36 +00:00
|
|
|
return pgQuery<EntityChunkPreview>(
|
|
|
|
|
`SELECT
|
|
|
|
|
c.chunk_pk, c.doc_id, c.chunk_id, c.page, c.type, c.bbox, c.classification,
|
|
|
|
|
c.content_pt, c.content_en, c.ufo_anomaly, c.ufo_anomaly_type
|
|
|
|
|
FROM public.entity_mentions em
|
|
|
|
|
JOIN public.chunks c ON c.chunk_pk = em.chunk_pk
|
|
|
|
|
WHERE em.entity_pk = $1
|
|
|
|
|
ORDER BY c.ufo_anomaly DESC NULLS LAST, c.doc_id, c.order_global
|
|
|
|
|
LIMIT $2`,
|
|
|
|
|
[entityPk, limit],
|
|
|
|
|
);
|
|
|
|
|
}
|
|
|
|
|
|
sanitize entities: single YAML source of truth, signal_strength badge
The corpus had two parallel reverse-reference signals: the wiki/pages
entities_extracted blocks (Haiku page-level) and public.entity_mentions
(Sonnet chunk-level, ILIKE-matched). The entity page only consulted the
DB, so it showed "0 menções" for thousands of entities that were anchored
in pages or in cross-entity links the DB never indexed.
Resolved by collapsing all signals into the YAML frontmatter, which is
now the single runtime source for entity metadata.
scripts/maintain/42_sync_entity_stats.py walks every entity and writes:
mentioned_in: [...] # consolidated page refs
total_mentions: max(db, pages)
documents_count: max(db_docs, distinct page docs)
signal_sources:
db_chunks: int
page_refs: int
cross_refs: int
signal_strength: strong | weak | orphan | unverified
referenced_by: [[class/id]] # cross-entity backlinks
Outgoing wikilinks (e.g. OBJ.observed_in_event → EV) count toward the
entity's own cross_refs so anchored-but-not-mentioned entities don't
register as orphan.
OBJ canonical names like "7m long, 1.3m high, two rocket motors,
smooth flow, rotary drive null UAP (OBJ-EV1945-PEYERLSHOTDOWN-01)"
are rewritten to "Peyerl shot down UAP" derived from observed_in_event,
preserving the full description as an alias. --fix-obj-names did this
for every OBJ-* with >80 char canonical_name.
Default behaviour is conservative: --archive-only-junk archives only
single/double-char names and pure-numeric noise. Everything else stays
on disk with signal_strength marked, so the user can review later.
web/lib/retrieval/entity-pages.ts swapped from db-first to yaml-first.
The /e/[cls]/[id] page now reads counts straight from YAML and renders
a "força do sinal" badge with the per-source breakdown. Orphan entities
get a banner explaining they have no cross-references.
DB is still queried for ONE thing: the chunk text for preview cards on
the entity page, so we don't re-parse 21k markdown files on every render.
First-pass result: 9020 strong / 14520 weak / 10814 orphan; OBJ-EV1945-
PEYERLSHOTDOWN-01 now reads "Peyerl shot down UAP · fraca · 1 backlink"
in the live UI.
2026-05-18 22:49:31 +00:00
|
|
|
// Backwards-compat for callers that imported findEntity from the old path.
|
|
|
|
|
export { findEntity } from "./graph";
|