[Forge] Replication runner - re-run analysis on a held-out data slice done

← Epistemic Rigor
Auto-replicate finalised analyses with temporal_holdout/cohort_swap/seed_perturb/model_swap; replication_status feeds epistemic_tiers.

Completion Notes

Implementation on branch; supervisor squash-merges via orchestra sync push. Files: scidex/atlas/replication_runner.py (replicate, compute_replication_status, get_replicas, queue_daily_replications), migrations/128_replication_tables.py, scidex/senate/epistemic_tiers.py (+replication ratchet), api.py (Replications panel), tests/atlas/test_replication_runner.py (24 tests passing). All acceptance criteria addressed.

Git Commits (2)

[Forge] Update replication runner spec work log [task:3b558828-86c3-4582-923a-6fd6344736d7]2026-04-27
[Forge] Replication runner: replicate() analysis on held-out data slice [task:3b558828-86c3-4582-923a-6fd6344736d7]2026-04-27
Spec File

Goal

Trust scores treat single-shot analyses as equally credible as
replicated ones. We have replication_clustering.py for clustering existing replications but no automated generator of new ones. Build
a Replication Runner: take a finalised analysis, re-execute it
verbatim on a different data slice (different cohort, different
publication-year window, different gene-list seed), then auto-compare
verdicts and assign a replication_status ∈ {confirmed, partial,
contradicted, untestable} to the original artifact.

Acceptance Criteria

scidex/atlas/replication_runner.py::replicate(analysis_id, slice_strategy) returns a new analysis_id linked via replication_links(parent_id, replica_id, strategy, similarity, verdict_match).
☐ Slice strategies: temporal_holdout (papers > 2024 only), cohort_swap (other disease cohort if available), seed_perturb (different starter gene list with same domain), model_swap (Sonnet instead of Opus).
☐ When ≥ 2 replicas land, compute replication_status and write to analyses.replication_status + replication_history(analysis_id, status, n_replicas, verdict_consistency, computed_at).
epistemic_tiers.classify_* reads replication_statusconfirmed ratchets tier toward T1, contradicted ratchets toward T4.
☐ Daily cron picks 5 highest-priority T2/T3 analyses lacking replicas and queues replicas (auction-priced).
/analysis/{id} shows a "Replications" panel with each replica's verdict + similarity heatmap.
☐ Test: fixture analysis A; replicate with temporal_holdout produces analysis B; verdict comparison correct; status set to confirmed when both verdicts agree.

Approach

  • Reuse the agent.py debate-spawn entry point but pass an overridden seed corpus.
  • verdict_match = cosine similarity over verdict embeddings (use existing vector_search engine for embeddings) plus dimension-by-dimension agreement.
  • Slice strategy temporal_holdout: filter papers.published_year >= cutoff in the seed corpus; record the cutoff in replication_links.config_json.
  • Refuse to replicate if the original analysis is < 24 h old (prereg outcomes might still be open).
  • Dependencies

    • q-er-preregistration (replicas inherit the original prereg).
    • epistemic_tiers.py (consumes replication_status).

    Work Log

    Payload JSON
    {
      "completion_shas": [
        "cc8cb612f",
        "4561ca06f"
      ],
      "completion_shas_checked_at": ""
    }

    Sibling Tasks in Quest (Epistemic Rigor) ↗