[Atlas] KG ↔ dataset cross-link driver (driver #30)

← All Specs

[Atlas] KG ↔ dataset cross-link driver (driver #30)

Task

  • ID: f4f09ad5-2315-46fe-a526-fa6ab92dfe23
  • Type: recurring
  • Frequency: every-6h
  • Layer: Atlas

Goal

Make versioned tabular datasets first-class citizens of the knowledge graph by
automatically linking every dataset row whose primary identifier matches a KG
node, and reverse-linking every KG node that has tabular data. Turns the
graph from a purely text-derived structure into a hybrid graph + dataframe
index that wiki pages and analyses can navigate both ways.

What it does

  • For each registered dataset, inspects its primary identifier column
(e.g. ad_genetic_risk_loci.gene_symbol).
  • For each dataset row, looks up a matching knowledge_graph node of the
corresponding type (gene, drug, pathway, ...):
- Inserts a row in node_wiki_links (or knowledge_edges) connecting the
KG node to the dataset row, with edge type has_data_in.
  • Reverse direction: for every KG node of that type that has a matching
dataset row, inserts a data_in edge.
  • Idempotent: skips pairs that already have an edge of the correct type.
  • Release as a no-op when no new matches exist.
  • Emits agent_contributions (type=kg_dataset_link) per edge created.

Success criteria

  • Every dataset row whose primary identifier matches a KG node has exactly
one has_data_in edge (verified by SQL audit).
  • Every KG node with a matching dataset row has exactly one data_in edge.
  • Wiki pages about a gene that has a matching dataset row start rendering a
"Data" section driven by the new edges.
  • Run log: dataset rows scanned, KG nodes scanned, edges created, duplicates
skipped, retries.

Quality requirements

  • No stubs: each edge must carry real metadata (edge type, dataset row PK,
KG node PK) — no empty or "unknown" edges — link to meta-quest
quest_quality_standards_spec.md.
  • When processing ≥10 datasets per cycle (likely once multiple datasets
land), use 3–5 parallel agents (one per dataset) to fan out the matching.
  • Log total items processed + retries so we can detect busywork (driver
repeatedly scanning the same dataset with no new edges → move to event-
driven trigger on dataset commit).
  • Matching keys are case-sensitive by default; use the column's declared
normalisation (HGNC symbols for genes, RxCUI for drugs, etc.). INFERRED:
follow the dataset registry's identifier_type metadata.

File: f4f09ad5-2315-46fe-a526-fa6ab92dfe23_kg_dataset_crosslink_driver_spec.md
Modified: 2026-05-01 20:13
Size: 2.6 KB