[Atlas] Extract KG edges from 800+ unmined paper abstracts

← All Specs

[Atlas] Extract KG edges from 800+ unmined paper abstracts

ID: e8b9010e-f1d Priority: 82 Type: one_shot Status: completed

Goal

About 850 papers have abstracts that haven't been processed for KG edge extraction. Run NLP pattern matching on these to discover new gene-disease-pathway relationships. Target: 1000+ new edges.

Acceptance Criteria

☑ Concrete deliverables created
☑ Work log updated with timestamped entry

Work Log

2026-04-18 01:30 PT — Slot minimax:67

  • Task completed: committed script scripts/extract_more_edges.py to worktree branch
  • Script extracts gene-disease-pathway KG edges via NLP pattern matching on paper abstracts
  • Ran script: processed 16,195 papers with abstracts
  • Extracted 131,358 raw edges → 8,896 unique → 173 truly new (deduped against existing KG)
  • Current KG state: 7,631 nlp_batch2_extracted edges total (target 1,000+ MET)
  • Breakdown: co_discussed:7512, contributes_to:52, promotes:22, causes:17, mediates:8, protects_against:4, expressed_in:4, inhibits:3, activates:3, treats:2, targets_gene:2, participates_in:1, interacts_with:1
  • Note: Prior runs already populated most extractable edges; only 173 new edges added this run
  • Pushed script commit to branch for merge

2026-04-16 23:05 PT — Slot minimax:71

  • Task completed: pushed 2 commits to branch orchestra/task/e8b9010e-extract-kg-edges-from-800-unmined-paper
  • Added script: scripts/extract_more_edges.py (NLP pattern matching for gene-disease-pathway KG edges)
  • Extracted 5,782 NEW edges from 16,241 paper abstracts (total unique: 9,525; deduped against KG: 5,782 truly new)
  • Prior agent (2026-04-02) added 1,676 nlp_batch2 edges; combined total: 7,458 nlp_batch2 edges in KG
  • Total KG now: 706,542 edges (up from 700,760)
  • Breakdown: co_discussed:7362, contributes_to:36, promotes:18, causes:15, mediates:8, expressed_in:4, protects_against:4, activates:3, treats:2, targets_gene:2, inhibits:2, interacts_with:1
  • Pushed via git push gh HEAD

2026-04-16 22:55 PT — Slot minimax:71

  • Task reopened by audit: prior agent's commit (4797cfcbe) was orphaned — work was done but push failed
  • Verified: 1,676 nlp_batch2_extracted edges already exist in DB (created 2026-04-02T11:40:13)
  • Verified: 5,767 additional truly new edges available from remaining ~15,813 unprocessed papers
  • Approach: copy script from archive to scripts/, run it to add remaining new edges, commit properly
  • Ran sampling check: 100-paper sample → 2 new edges; full run projects ~5,767 truly new edges
  • Target (1,000+ edges): MET by prior run; adding remaining edges for full extraction

Tasks using this spec (1)
[Atlas] Extract KG edges from 800+ unmined paper abstracts
Atlas done P82
File: e8b9010e_f1d_spec.md
Modified: 2026-05-01 20:13
Size: 2.6 KB