[Atlas] Extract KG edges from 800+ unmined paper abstracts done

← Atlas
About 850 papers have abstracts that haven't been processed for KG edge extraction. Run NLP pattern matching on these to discover new gene-disease-pathway relationships. Target: 1000+ new edges. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (6)

[Atlas] Work log: verify edge extraction — 7,631 nlp_batch2 edges in KG [task:e8b9010e-f1d8-4cbc-892c-3219a9cca325]2026-04-18
[Atlas] Add extract_more_edges.py — NLP KG edge extraction from abstracts [task:e8b9010e-f1d8-4cbc-892c-3219a9cca325]2026-04-18
[Atlas] Work log: complete — 5,782 new KG edges extracted [task:e8b9010e-f1d8-4cbc-892c-3219a9cca325]2026-04-16
[Atlas] Add extract_more_edges.py — NLP KG edge extraction from abstracts [task:e8b9010e-f1d8-4cbc-892c-3219a9cca325]2026-04-16
[Atlas] Extract KG edges from 800+ unmined paper abstracts [task:e8b9010e-f1d8-4cbc-892c-3219a9cca325]2026-04-16
[Atlas] Extract 1,676 new KG edges from paper abstracts2026-04-02
Spec File

[Atlas] Extract KG edges from 800+ unmined paper abstracts

ID: e8b9010e-f1d Priority: 82 Type: one_shot Status: completed

Goal

About 850 papers have abstracts that haven't been processed for KG edge extraction. Run NLP pattern matching on these to discover new gene-disease-pathway relationships. Target: 1000+ new edges.

Acceptance Criteria

☑ Concrete deliverables created
☑ Work log updated with timestamped entry

Work Log

2026-04-18 01:30 PT — Slot minimax:67

  • Task completed: committed script scripts/extract_more_edges.py to worktree branch
  • Script extracts gene-disease-pathway KG edges via NLP pattern matching on paper abstracts
  • Ran script: processed 16,195 papers with abstracts
  • Extracted 131,358 raw edges → 8,896 unique → 173 truly new (deduped against existing KG)
  • Current KG state: 7,631 nlp_batch2_extracted edges total (target 1,000+ MET)
  • Breakdown: co_discussed:7512, contributes_to:52, promotes:22, causes:17, mediates:8, protects_against:4, expressed_in:4, inhibits:3, activates:3, treats:2, targets_gene:2, participates_in:1, interacts_with:1
  • Note: Prior runs already populated most extractable edges; only 173 new edges added this run
  • Pushed script commit to branch for merge

2026-04-16 23:05 PT — Slot minimax:71

  • Task completed: pushed 2 commits to branch orchestra/task/e8b9010e-extract-kg-edges-from-800-unmined-paper
  • Added script: scripts/extract_more_edges.py (NLP pattern matching for gene-disease-pathway KG edges)
  • Extracted 5,782 NEW edges from 16,241 paper abstracts (total unique: 9,525; deduped against KG: 5,782 truly new)
  • Prior agent (2026-04-02) added 1,676 nlp_batch2 edges; combined total: 7,458 nlp_batch2 edges in KG
  • Total KG now: 706,542 edges (up from 700,760)
  • Breakdown: co_discussed:7362, contributes_to:36, promotes:18, causes:15, mediates:8, expressed_in:4, protects_against:4, activates:3, treats:2, targets_gene:2, inhibits:2, interacts_with:1
  • Pushed via git push gh HEAD

2026-04-16 22:55 PT — Slot minimax:71

  • Task reopened by audit: prior agent's commit (4797cfcbe) was orphaned — work was done but push failed
  • Verified: 1,676 nlp_batch2_extracted edges already exist in DB (created 2026-04-02T11:40:13)
  • Verified: 5,767 additional truly new edges available from remaining ~15,813 unprocessed papers
  • Approach: copy script from archive to scripts/, run it to add remaining new edges, commit properly
  • Ran sampling check: 100-paper sample → 2 new edges; full run projects ~5,767 truly new edges
  • Target (1,000+ edges): MET by prior run; adding remaining edges for full extraction

Payload JSON
{
  "_reset_note": "This task was reset after a database incident on 2026-04-17.\n\n**Context:** SciDEX migrated from SQLite to PostgreSQL after recurring DB\ncorruption. Some work done during Apr 16-17 may have been lost.\n\n**Before starting work:**\n1. Check if the task's goal is ALREADY satisfied (run the relevant checks)\n2. Check `git log --all --grep=task:YOUR_TASK_ID` for prior commits\n3. If complete, verify and mark done. If partial, continue. If not done, proceed.\n\n**DB change:** SciDEX now uses PostgreSQL. `get_db()` auto-detects via\nSCIDEX_DB_BACKEND=postgres env var.",
  "_reset_at": "2026-04-18T06:29:22.046013+00:00",
  "_reset_from_status": "done"
}

Sibling Tasks in Quest (Atlas) ↗