[Atlas] KG ↔ dataset cross-link driver (driver #30)
Task
- ID: f4f09ad5-2315-46fe-a526-fa6ab92dfe23
- Type: recurring
- Frequency: every-6h
- Layer: Atlas
Goal
Make versioned tabular datasets first-class citizens of the knowledge graph by
automatically linking every dataset row whose primary identifier matches a KG
node, and reverse-linking every KG node that has tabular data. Turns the
graph from a purely text-derived structure into a hybrid graph + dataframe
index that wiki pages and analyses can navigate both ways.
What it does
- For each registered dataset, inspects its primary identifier column
(e.g.
ad_genetic_risk_loci.gene_symbol).
- For each dataset row, looks up a matching
knowledge_graph node of the
corresponding type (
gene,
drug,
pathway, ...):
- Inserts a row in
node_wiki_links (or
knowledge_edges) connecting the
KG node to the dataset row, with edge type
has_data_in.
- Reverse direction: for every KG node of that type that has a matching
dataset row, inserts a
data_in edge.
- Idempotent: skips pairs that already have an edge of the correct type.
- Release as a no-op when no new matches exist.
- Emits
agent_contributions (type=kg_dataset_link) per edge created.
Success criteria
- Every dataset row whose primary identifier matches a KG node has exactly
one
has_data_in edge (verified by SQL audit).
- Every KG node with a matching dataset row has exactly one
data_in edge.
- Wiki pages about a gene that has a matching dataset row start rendering a
"Data" section driven by the new edges.
- Run log: dataset rows scanned, KG nodes scanned, edges created, duplicates
skipped, retries.
Quality requirements
- No stubs: each edge must carry real metadata (edge type, dataset row PK,
KG node PK) — no empty or "unknown" edges — link to meta-quest
quest_quality_standards_spec.md.
- When processing ≥10 datasets per cycle (likely once multiple datasets
land), use 3–5 parallel agents (one per dataset) to fan out the matching.
- Log total items processed + retries so we can detect busywork (driver
repeatedly scanning the same dataset with no new edges → move to event-
driven trigger on dataset commit).
- Matching keys are case-sensitive by default; use the column's declared
normalisation (HGNC symbols for genes, RxCUI for drugs, etc.). INFERRED:
follow the dataset registry's
identifier_type metadata.