[Atlas] KG ↔ dataset cross-link driver (driver #30)

Task

ID: f4f09ad5-2315-46fe-a526-fa6ab92dfe23
Type: recurring
Frequency: every-6h
Layer: Atlas

Goal

Make versioned tabular datasets first-class citizens of the knowledge graph by
automatically linking every dataset row whose primary identifier matches a KG
node, and reverse-linking every KG node that has tabular data. Turns the
graph from a purely text-derived structure into a hybrid graph + dataframe
index that wiki pages and analyses can navigate both ways.

What it does

For each registered dataset, inspects its primary identifier column

(e.g. ad_genetic_risk_loci.gene_symbol).

For each dataset row, looks up a matching knowledge_graph node of the

corresponding type (gene, drug, pathway, ...):
- Inserts a row in node_wiki_links (or knowledge_edges) connecting the
KG node to the dataset row, with edge type has_data_in.

Reverse direction: for every KG node of that type that has a matching

dataset row, inserts a data_in edge.

Idempotent: skips pairs that already have an edge of the correct type.
Release as a no-op when no new matches exist.
Emits agent_contributions (type=kg_dataset_link) per edge created.

Success criteria

Every dataset row whose primary identifier matches a KG node has exactly

one has_data_in edge (verified by SQL audit).

Every KG node with a matching dataset row has exactly one data_in edge.
Wiki pages about a gene that has a matching dataset row start rendering a

"Data" section driven by the new edges.

Run log: dataset rows scanned, KG nodes scanned, edges created, duplicates

skipped, retries.

Quality requirements

No stubs: each edge must carry real metadata (edge type, dataset row PK,

KG node PK) — no empty or "unknown" edges — link to meta-quest
quest_quality_standards_spec.md.

When processing ≥10 datasets per cycle (likely once multiple datasets

land), use 3–5 parallel agents (one per dataset) to fan out the matching.

Log total items processed + retries so we can detect busywork (driver

repeatedly scanning the same dataset with no new edges → move to event-
driven trigger on dataset commit).

Matching keys are case-sensitive by default; use the column's declared

normalisation (HGNC symbols for genes, RxCUI for drugs, etc.). INFERRED:
follow the dataset registry's identifier_type metadata.

File: f4f09ad5-2315-46fe-a526-fa6ab92dfe23_kg_dataset_crosslink_driver_spec.md

Modified: 2026-05-01 20:13

Size: 2.6 KB