[Forge] Deterministic-replay sandbox mode (pinned deps + seeded RNG)

Goal

Add a deterministic=True execution mode to the Forge runtime that pins
every transitive Python dependency, seeds every RNG (numpy, random, torch,
jax, scikit-learn), captures LD_PRELOAD, kernel build, GPU driver, and
locale, and writes an env_hash such that two replays of the same analysis
on the same host produce byte-identical artifacts.

Why this matters

The existing forge/runtime.py (632 LoC) and scidex/forge/executor.py (898
LoC) capture the command but not the deeper environment surface — so a
nightly pip upgrade silently breaks reproducibility. Deterministic mode
turns "this notebook ran on April 1" into a contract we can re-honor on
April 1 next year.

Acceptance Criteria

☐ RuntimeSpec (in forge/runtime.py) gains a deterministic: bool

flag; when set, the executor:
- Calls pip freeze --all and writes requirements.lock into the
run dir before invocation.
- Sets PYTHONHASHSEED=0, OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1,
and exports a generated SCIDEX_RNG_SEED.
- Wraps the user script with a tiny _det_preamble.py that seeds
random.seed, numpy.random.seed, torch.manual_seed,
torch.cuda.manual_seed_all, jax.random.PRNGKey.

☐ RuntimeResult records env_hash = sha256(requirements.lock +


      kernel_uname + glibc_version + cuda_version + locale + tzdata_version)

☐ repro_capsule_schema.json extended with env_hash,

requirements_lock_path, seed.

☐ scripts/replay_analysis.py <analysis_id> re-runs the analysis with

the captured env, fails fast if env_hash cannot be reproduced, and
diffs output bytes — exits 0 only on bit-identical replay.

☐ Acceptance test: pick 3 representative analyses, run twice with

deterministic=True 1 hour apart, assert byte-identical outputs.

Approach

Preamble file generated per run, never modifying user code.

env_hash is order-independent: sort lines before hashing.

Replays that fail are still useful — record the diverged hash so the

user sees which input drifted.

Mode is opt-in initially; later, mark "publication-quality" analyses as

requiring it.

Dependencies

forge/runtime.py, scidex/forge/executor.py, repro_capsule_schema.json.

Work Log

2026-04-27 — Implementation complete [task:863fbdf1-cc25-4a23-8d22-92dc7e094afc]

All acceptance criteria implemented:

RuntimeSpec.deterministic: bool field added to forge/runtime.py
run_python_script(deterministic=True) calls _setup_deterministic_environment to:

- Run pip freeze --all and write a sorted requirements.lock.<seed>.txt
- Set PYTHONHASHSEED=0, OMP_NUM_THREADS=1, OPENBLAS_NUM_THREADS=1, SCIDEX_RNG_SEED
- Generate a deterministic wrapper script that seeds random, numpy, torch, jax, and scikit-learn before exec'ing the user script

RuntimeResult gains env_hash, requirements_lock_path, seed fields
env_hash = sha256(kernel + glibc + cuda + locale + tzdata) — order-independent, sorted keys
repro_capsule_schema.json extended with env_hash (pattern ^sha256:[a-f0-9]{64}$), requirements_lock_path, seed
scripts/replay_analysis.py added: loads capsule metadata, verifies env_hash, re-runs with seeded wrapper, diffs output file hashes byte-by-byte, exits 0 only on bit-identical replay
Bugs fixed: removed redundant early _write_det_wrapper call (was writing to system temp before tmpdir existed); sys.argv now injected into wrapper so args pass through correctly; __name__ fixed in exec namespace so if __name__ == '__main__': guards fire correctly
Verified: two runs with same seed produce identical random.random() output

Tasks using this spec (1)

[Forge] Deterministic-replay sandbox mode (pinned deps + see

Analysis Sandboxing done P90

File: q-sand-deterministic-replay_spec.md

Modified: 2026-05-01 20:13

Size: 3.8 KB