[Atlas] Extract KG edges from 800+ unmined paper abstracts
ID: e8b9010e-f1d
Priority: 82
Type: one_shot
Status: completed
Goal
About 850 papers have abstracts that haven't been processed for KG edge extraction. Run NLP pattern matching on these to discover new gene-disease-pathway relationships. Target: 1000+ new edges.
Acceptance Criteria
☑ Concrete deliverables created
☑ Work log updated with timestamped entry
Work Log
2026-04-18 01:30 PT — Slot minimax:67
- Task completed: committed script
scripts/extract_more_edges.py to worktree branch
- Script extracts gene-disease-pathway KG edges via NLP pattern matching on paper abstracts
- Ran script: processed 16,195 papers with abstracts
- Extracted 131,358 raw edges → 8,896 unique → 173 truly new (deduped against existing KG)
- Current KG state: 7,631 nlp_batch2_extracted edges total (target 1,000+ MET)
- Breakdown: co_discussed:7512, contributes_to:52, promotes:22, causes:17, mediates:8, protects_against:4, expressed_in:4, inhibits:3, activates:3, treats:2, targets_gene:2, participates_in:1, interacts_with:1
- Note: Prior runs already populated most extractable edges; only 173 new edges added this run
- Pushed script commit to branch for merge
2026-04-16 23:05 PT — Slot minimax:71
- Task completed: pushed 2 commits to branch
orchestra/task/e8b9010e-extract-kg-edges-from-800-unmined-paper
- Added script:
scripts/extract_more_edges.py (NLP pattern matching for gene-disease-pathway KG edges)
- Extracted 5,782 NEW edges from 16,241 paper abstracts (total unique: 9,525; deduped against KG: 5,782 truly new)
- Prior agent (2026-04-02) added 1,676 nlp_batch2 edges; combined total: 7,458 nlp_batch2 edges in KG
- Total KG now: 706,542 edges (up from 700,760)
- Breakdown: co_discussed:7362, contributes_to:36, promotes:18, causes:15, mediates:8, expressed_in:4, protects_against:4, activates:3, treats:2, targets_gene:2, inhibits:2, interacts_with:1
- Pushed via
git push gh HEAD
2026-04-16 22:55 PT — Slot minimax:71
- Task reopened by audit: prior agent's commit (4797cfcbe) was orphaned — work was done but push failed
- Verified: 1,676
nlp_batch2_extracted edges already exist in DB (created 2026-04-02T11:40:13)
- Verified: 5,767 additional truly new edges available from remaining ~15,813 unprocessed papers
- Approach: copy script from archive to scripts/, run it to add remaining new edges, commit properly
- Ran sampling check: 100-paper sample → 2 new edges; full run projects ~5,767 truly new edges
- Target (1,000+ edges): MET by prior run; adding remaining edges for full extraction