SciDEX — Task: [Forge] Replication runner

Auto-replicate finalised analyses with temporal_holdout/cohort_swap/seed_perturb/model_swap; replication_status feeds epistemic_tiers.

Completion Notes

Implementation on branch; supervisor squash-merges via orchestra sync push. Files: scidex/atlas/replication_runner.py (replicate, compute_replication_status, get_replicas, queue_daily_replications), migrations/128_replication_tables.py, scidex/senate/epistemic_tiers.py (+replication ratchet), api.py (Replications panel), tests/atlas/test_replication_runner.py (24 tests passing). All acceptance criteria addressed.

Git Commits (2)

[Forge] Update replication runner spec work log [task:3b558828-86c3-4582-923a-6fd6344736d7]2026-04-27

[Forge] Replication runner: replicate() analysis on held-out data slice [task:3b558828-86c3-4582-923a-6fd6344736d7]2026-04-27

Spec File

Goal

Trust scores treat single-shot analyses as equally credible as
replicated ones. We have replication_clustering.py for clustering existing replications but no automated generator of new ones. Build
a Replication Runner: take a finalised analysis, re-execute it
verbatim on a different data slice (different cohort, different
publication-year window, different gene-list seed), then auto-compare
verdicts and assign a replication_status ∈ {confirmed, partial,
contradicted, untestable} to the original artifact.

Acceptance Criteria

☐ scidex/atlas/replication_runner.py::replicate(analysis_id, slice_strategy) returns a new analysis_id linked via replication_links(parent_id, replica_id, strategy, similarity, verdict_match).

☐ Slice strategies: temporal_holdout (papers > 2024 only), cohort_swap (other disease cohort if available), seed_perturb (different starter gene list with same domain), model_swap (Sonnet instead of Opus).

☐ When ≥ 2 replicas land, compute replication_status and write to analyses.replication_status + replication_history(analysis_id, status, n_replicas, verdict_consistency, computed_at).

☐ epistemic_tiers.classify_* reads replication_status — confirmed ratchets tier toward T1, contradicted ratchets toward T4.

☐ Daily cron picks 5 highest-priority T2/T3 analyses lacking replicas and queues replicas (auction-priced).

☐ /analysis/{id} shows a "Replications" panel with each replica's verdict + similarity heatmap.

☐ Test: fixture analysis A; replicate with temporal_holdout produces analysis B; verdict comparison correct; status set to confirmed when both verdicts agree.

Approach

Reuse the agent.py debate-spawn entry point but pass an overridden seed corpus.

verdict_match = cosine similarity over verdict embeddings (use existing vector_search engine for embeddings) plus dimension-by-dimension agreement.

Slice strategy temporal_holdout: filter papers.published_year >= cutoff in the seed corpus; record the cutoff in replication_links.config_json.

Refuse to replicate if the original analysis is < 24 h old (prereg outcomes might still be open).

Dependencies

q-er-preregistration (replicas inherit the original prereg).
epistemic_tiers.py (consumes replication_status).

Work Log

Payload JSON

{
  "completion_shas": [
    "cc8cb612f",
    "4561ca06f"
  ],
  "completion_shas_checked_at": ""
}

Sibling Tasks in Quest (Epistemic Rigor) ↗

✓[Atlas] Hypothesis predictions table — explicit falsifiabilityP95

✓[Forge] Experiment results and validation pipelineP93

✓[Atlas] Evidence chain provenance — trace every claim to ground truthP92

✓[Agora] epi-01-PRED: Add hypothesis_predictions table for falsifiable predictionsP92

✓[Senate] Pre-registration - write predictions before running any analysisP92

✓[Atlas] Trust scores on knowledge graph edgesP91

✓[Atlas] Hypothesis and experiment dependency graphP90

✓[Senate] Field-shift detector - auto-report when consensus movesP90

✓[Senate] Per-agent calibration tracker - Brier scores from frozen predictionsP89

✓[Senate] Evidence versioning and audit trailP88

[Forge] Replication runner - re-run analysis on a held-out data slice done