[Atlas] Expand KG — Add edges from PubMed abstracts for top entities
Task ID: e8ba526a-b61a-4bed-a2db-a48cf68d72d5
Status: In Progress
Priority: P90
Goal
Use NLP extraction to add 2000+ new KG edges from PubMed abstracts, focusing on top entities by connection count.
Approach
Identify top gene/protein entities that have high co-occurrence edges but lower NLP-extracted edges
Fetch new PubMed papers for these entities (not already in DB)
Extract edges using sentence-level NLP patterns (relation classification)
Insert new edges with evidence (PMID + title) and edge_type tracking
Target: 2000+ new edgesKey Entities to Target
Top gene/protein entities: MTOR, AKT, TNF, PI3K, BDNF, APOE, APP, TAU, NLRP3, AMPK, SQSTM1, PINK1, TREM2, STAT3, NRF2, PARKIN, IL-6, LRRK2, SOD1, BACE1, SNCA, MAPT, GBA, FUS, TDP-43, PSEN1, PSEN2, CLU, BIN1, CD33
Work Log
- 2026-04-02: Started. Current KG: 668,364 edges, 6,095 papers with abstracts. Building extraction script targeting gene entities with high co-occurrence but lower NLP coverage.
- 2026-04-02: Completed. Built expand_kg_pubmed_v10.py. Fetched 2,121 new papers from PubMed across 257 queries (79 target genes x 3 templates + 20 cross-entity queries). Extracted and inserted 3,225 new edges (edge_type=pubmed_nlp_v10) and stored ~3,900 new papers. Total KG now ~679,768 edges, ~10,909 papers. Relation distribution: expressed_in (597), regulates (492), associated_with (442), activates (420), interacts_with (300), inhibits (227), causes (194), phosphorylates (173).