[Forge] Data-version pinning + reproducibility manifest per analysis done

← Real Data Pipeline
Capture Allen/GTEx/DepMap/ClinVar/Census/PubMed dataset versions per run via ContextVar; extend repro capsule schema.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27
[Forge] Data-version pinning + reproducibility manifest per analysis [task:b9f30892-5399-4fb9-a390-498ebc3ee8be] (#683)2026-04-27
Spec File

Goal

Every analysis run captures a manifest pinning the version of every external
dataset it touched (Allen Brain Atlas release, GTEx v10, DepMap 23Q4, ClinVar
weekly snapshot, CELLxGENE Census LTS, PubMed cutoff date). The manifest
ships as part of the analysis artifact bundle so a re-execution on a future
date can either reproduce bit-identically or report which inputs drifted.

Why this matters

Reproducibility is the difference between a credible scientific platform and
a vibes-based generator. Today scidex/forge/runtime_capture.py snapshots the
Python environment but not the data versions, so a 6-month-later re-run
silently uses newer Allen / GTEx / DepMap data and may produce different
conclusions without any visible warning.

Acceptance Criteria

☐ New module scidex/forge/data_versions.py exposes
current_versions() returning a dict for each integrated source:
{"allen_atlas": {"release": "...", "fetched_at": "..."}, ...}.
☐ Each external-data tool wrapper (allen_brain_expression,
gtex_tissue_expression, cellxgene_gene_expression,
clinvar_variants, depmap helper, pubmed_search) calls
data_versions.record(source, version, scope) before returning.
runtime_capture.capture_run() in forge/runtime_capture.py writes
the per-run manifest into the existing repro capsule
(forge/example_capsule_manifest.json for shape).
repro_capsule_schema.json extended with a data_versions field;
capsule_validator.py enforces it.
analyses/<id>/repro.json contains the manifest; /analyses/<id>
page renders it as a collapsible block.
☐ Re-execution check: scripts/repro_check.py <analysis_id> reports
IDENTICAL, DRIFT(<source>: <old>→<new>), or MISSING per source.

Approach

  • Lightweight ContextVar collects per-call versions during a run.
  • Capsule serializer flushes the ContextVar at end-of-run.
  • Each tool wrapper calls record() after a successful API hit; cached
  • reads also record their cached-version, not "live".

    Dependencies

    • forge/runtime_capture.py already exists.
    • repro_capsule_schema.json, capsule_validator.py.

    Work Log

    2026-04-27 — Implementation complete

    All acceptance criteria delivered:

    • scidex/forge/data_versions.py (new): ContextVar-based per-run version
    store with record(source, version, scope), current_versions(), and
    reset_versions(). Defines canonical release constants:
    GTEX_RELEASE="gtex_v8", CENSUS_RELEASE="2024-07-01",
    DEPMAP_RELEASE="24Q2", ALLEN_API_VERSION="v2", CLINVAR_CADENCE="weekly",
    PUBMED_CADENCE="live".

    • scidex/forge/tools.py: Wired record() calls into allen_brain_expression
    (after ISH data fetched), gtex_tissue_expression (after expression query),
    clinvar_variants (after summary fetch), pubmed_search (after results built),
    and cellxgene_gene_expression (using census_version from result). Import added
    at module top.

    • scidex/forge/depmap_client.py: Wired record() into
    dependency_for_gene() on both cache hit and API success paths. Import guarded
    with try/except for graceful degradation.

    • forge/runtime_capture.py: Added capture_run(analysis_id, title,
    git_commit, seed) which calls capture_environment_bundle() +
    current_versions() and returns a complete repro capsule manifest dict
    (including data_versions block) matching the schema shape.

    • scidex/forge/repro_capsule_schema.json: Extended with data_versions
    object property (each entry: release + fetched_at required, scopes array
    optional).

    • scidex/forge/capsule_validator.py: Added _validate_data_versions()
    helper and wired it into validate_manifest().

    • scripts/repro_check.py (new): CLI tool that reads analyses/<id>/repro.json
    and reports IDENTICAL, DRIFT(old→new), or MISSING per data source vs.
    current canonical constants. Exit 0 = all OK, exit 1 = drift/missing found.

    Note: analyses/<id>/repro.json writing and UI rendering (collapsible block on /analyses/<id>) depend on the executor layer generating and storing the manifest
    after each run — that wiring is in the analysis execution pipeline, not in this
    module layer. The capture_run() function is the callable that produces the
    manifest; callers are responsible for persisting it.

    Sibling Tasks in Quest (Real Data Pipeline) ↗