[Atlas] Expand KG with PubMed-backed edges for top entities

← All Specs

[Atlas] Expand KG with PubMed-backed edges for top entities

Task ID: ea2c560f-7c63-4f93-a2c4-794d6404c034 Layer: Atlas Priority: P90

Goal

Add 2000+ new edges to the knowledge graph by extracting entity co-occurrences from PubMed paper abstracts, focusing on top 50 entities by connectivity.

Implementation

  • Created enrich_kg_bulk.py — NLP-based co-occurrence extraction script
  • 53 primary entities (genes, proteins, diseases, cell types, processes) with 36 aliases
  • 122 secondary entities (additional genes, brain regions, diseases, processes) with aliases
  • Relation inference from sentence context using keyword matching (13 relation types)
  • Evidence tracked via PMID arrays per edge

Results

  • Papers processed: 1,510
  • Raw edge extractions: 8,827
  • Unique edges (deduplicated): 3,706
  • New edges inserted: 2,066
  • Existing edges updated with new PMID evidence: ~1,640
  • Relation diversity: 13 types (associated_with, activates, inhibits, regulates, mediates, etc.)
  • KG total: 287,537 → 289,603

Work Log

  • 2026-04-02 T08:30 — Started, reviewed existing enrichment scripts
  • 2026-04-02 T08:45 — Created bulk enrichment script with NLP co-occurrence approach
  • 2026-04-02 T08:55 — Ran enrichment, added 2,066 new PubMed-backed edges

File: ea2c560f_atlas_kg_enrich_spec.md
Modified: 2026-05-01 20:13
Size: 1.3 KB