[Forge] Build automated PubMed update pipeline for hypothesis evidence

← All Specs

[Forge] Build automated PubMed update pipeline for hypothesis evidence

Task ID: f4f82aa1-d80c-4199-844d-6fbfb44b15a0

Goal

Create a recurring pipeline that searches PubMed for new papers related to top hypotheses and updates evidence_for/evidence_against fields. The pipeline should deduplicate, track audit trails, and run as a daemon.

Acceptance Criteria

☑ Pipeline searches PubMed for recent papers matching hypothesis terms
☑ Deduplicates by PMID before adding new evidence
☑ Tracks per-hypothesis last-checked timestamps
☑ Audit trail in pubmed_updates table
☑ Papers saved to corpus (papers table)
☑ Runs as systemd daemon (scidex-pubmed-update.service)
☑ Skips recently-checked hypotheses to avoid API waste
☑ DB lock retry logic for concurrent access

Approach

Pipeline already existed (pubmed_update_pipeline.py) with daemon mode and systemd service. Improvements:

  • Added staleness check: get_hypotheses_to_update now LEFT JOINs pubmed_update_log to skip hypotheses checked within min_hours_since_check (default 12h)
  • Verified all DB writes use _db_execute_with_retry for lock resilience
  • Work Log

    2026-04-02 11:45 PT — Slot 21

    • Reviewed existing pipeline: 557 lines, daemon mode, systemd service already running
    • Found running service crashed on DB lock because it was using older code version where _update_log used raw conn.execute instead of _db_execute_with_retry
    • Current code already has the fix, service needs restart to pick it up
    • Added staleness optimization: get_hypotheses_to_update now skips hypotheses checked within 12h via LEFT JOIN on pubmed_update_log
    • Tested dry run: pipeline correctly found 5 stale hypotheses and processed them
    • DB stats: 149 hypotheses tracked, 904 papers in audit trail
    • Result: Done — pipeline operational, added staleness check optimization

    File: f4f82aa1_spec.md
    Modified: 2026-05-01 20:13
    Size: 1.8 KB