[Atlas] Preprint-to-publication tracker - link arXiv/bioRxiv to PMID done

← Multi-Source Literature Search
Weekly poller links preprints to peer-reviewed versions via Crossref relations + EPMC + Semantic Scholar.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27
[Atlas] Preprint-to-publication tracker: link arXiv/bioRxiv to PMID [task:7e9cb60c-6b1b-4f35-9cc1-c07969101d52] (#661)2026-04-27
Spec File

Goal

For every preprint cited in a hypothesis or analysis (bioRxiv, medRxiv,
arXiv), poll Crossref + Europe PMC weekly to detect when the peer-reviewed
version lands. When detected, link the two records, surface the upgrade on
the originating hypothesis page, and emit an event the Skeptic can subscribe
to ("the preprint you doubted is now in Cell").

Why this matters

Right now an agent can cite bioRxiv 2024.05.12.594321 and that citation
just stays a preprint reference forever, even if the same paper was published
in Nature six months later. We lose the credibility uplift, fail to update
inline citation freshness, and miss a major signal for hypothesis re-scoring.

Acceptance Criteria

☐ Migration preprint_publication_link(preprint_id, preprint_source,
preprint_doi, published_doi, published_pmid, published_journal,
detection_method, detected_at)
.
scripts/preprint_publication_poller.py walks distinct preprint DOIs
cited in the last 12 months, queries:
- Crossref relation field (is-preprint-of),
- Europe PMC crossReferences for matching titles,
- Semantic Scholar externalIds.
☐ On a hit, write the link row, update each citing hypothesis's
evidence_version, and POST an event to event_bus of type
preprint_published with the (preprint, published) pair.
/hypothesis/<id> page replaces preprint citations with
"preprint → Nature 2024" badges sourced from the new table.
☐ Skeptic persona consumes the event and re-issues only those debate
rounds where the upgraded paper materially changes the picture
(heuristic: published version cited >5 times more than preprint).
☐ Runs on scidex-preprint-tracker.timer weekly Saturday 04:00 UTC.

Approach

  • Crossref relation fields are the most reliable signal — start there.
  • Title-shingle (token Jaccard >0.85) as last-resort match.
  • Persist detection_method so we can audit false-positive linking.
  • Dependencies

    • Existing crossref_paper_metadata and europe_pmc_search in
    scidex/forge/tools.py.
    • event_bus.py.

    Work Log

    2026-04-27 04:00 UTC — Slot 0

    • Migration created: scripts/migrations/003_add_preprint_publication_link.py
    - Table: preprint_publication_link with preprint_id, preprint_source, preprint_doi, published_doi, published_pmid, published_journal, detection_method, detected_at, created_at, updated_at
    - History table: preprint_publication_link_history with UPDATE/DELETE triggers
    - Verified: dry-run and real run both succeed
    • Poller script created: scripts/preprint_publication_poller.py
    - Queries preprint DOIs from papers table (biorxiv, medrxiv, arxiv)
    - Checks Crossref relations API for is-preprint-of links
    - Falls back to Semantic Scholar externalIds and Europe PMC title match (Jaccard >0.85)
    - Writes to preprint_publication_link table and emits preprint_published events
    - Verified: dry-run works, found arxiv.0711.0409 → PMID 18386982 via title_match
    • Event bus updated: Added preprint_published to EVENT_TYPES in scidex/core/event_bus.py
    • Hypothesis page updated: api.py:hypothesis_detail_page()
    - Collects preprint DOIs from evidence_for and evidence_against
    - Batch queries preprint_publication_link table
    - Renders "preprint → Journal Year" badges with PMID links
    - Syntax verified: python3 -m py_compile api.py passes
    • Testing: API status returns 200, hypothesis page /hypothesis/h-aging-h7-prs-aging-convergence returns 200

    Sibling Tasks in Quest (Multi-Source Literature Search) ↗