[Atlas] Expand KG with PubMed-backed edges for top entities
Task ID: ea2c560f-7c63-4f93-a2c4-794d6404c034
Layer: Atlas
Priority: P90
Goal
Add 2000+ new edges to the knowledge graph by extracting entity co-occurrences from PubMed paper abstracts, focusing on top 50 entities by connectivity.
Implementation
- Created
enrich_kg_bulk.py — NLP-based co-occurrence extraction script
- 53 primary entities (genes, proteins, diseases, cell types, processes) with 36 aliases
- 122 secondary entities (additional genes, brain regions, diseases, processes) with aliases
- Relation inference from sentence context using keyword matching (13 relation types)
- Evidence tracked via PMID arrays per edge
Results
- Papers processed: 1,510
- Raw edge extractions: 8,827
- Unique edges (deduplicated): 3,706
- New edges inserted: 2,066
- Existing edges updated with new PMID evidence: ~1,640
- Relation diversity: 13 types (associated_with, activates, inhibits, regulates, mediates, etc.)
- KG total: 287,537 → 289,603
Work Log
- 2026-04-02 T08:30 — Started, reviewed existing enrichment scripts
- 2026-04-02 T08:45 — Created bulk enrichment script with NLP co-occurrence approach
- 2026-04-02 T08:55 — Ran enrichment, added 2,066 new PubMed-backed edges