[Atlas] Expand KG — Add edges from PubMed abstracts for top entities

← All Specs

[Atlas] Expand KG — Add edges from PubMed abstracts for top entities

Task ID: e8ba526a-b61a-4bed-a2db-a48cf68d72d5 Status: In Progress Priority: P90

Goal

Use NLP extraction to add 2000+ new KG edges from PubMed abstracts, focusing on top entities by connection count.

Approach

  • Identify top gene/protein entities that have high co-occurrence edges but lower NLP-extracted edges
  • Fetch new PubMed papers for these entities (not already in DB)
  • Extract edges using sentence-level NLP patterns (relation classification)
  • Insert new edges with evidence (PMID + title) and edge_type tracking
  • Target: 2000+ new edges
  • Key Entities to Target

    Top gene/protein entities: MTOR, AKT, TNF, PI3K, BDNF, APOE, APP, TAU, NLRP3, AMPK, SQSTM1, PINK1, TREM2, STAT3, NRF2, PARKIN, IL-6, LRRK2, SOD1, BACE1, SNCA, MAPT, GBA, FUS, TDP-43, PSEN1, PSEN2, CLU, BIN1, CD33

    Work Log

    • 2026-04-02: Started. Current KG: 668,364 edges, 6,095 papers with abstracts. Building extraction script targeting gene entities with high co-occurrence but lower NLP coverage.
    • 2026-04-02: Completed. Built expand_kg_pubmed_v10.py. Fetched 2,121 new papers from PubMed across 257 queries (79 target genes x 3 templates + 20 cross-entity queries). Extracted and inserted 3,225 new edges (edge_type=pubmed_nlp_v10) and stored ~3,900 new papers. Total KG now ~679,768 edges, ~10,909 papers. Relation distribution: expressed_in (597), regulates (492), associated_with (442), activates (420), interacts_with (300), inhibits (227), causes (194), phosphorylates (173).

    File: e8ba526a_kg_edges_pubmed_spec.md
    Modified: 2026-05-01 20:13
    Size: 1.5 KB