[Forge] End-to-end MD-simulation analysis — input → OpenMM run → motion modes → artifact

← All Specs

Effort: extensive

Goal

Compose the molecular-dynamics skill (OpenMM + MDAnalysis) into a production MD pipeline: given a structure (PDB id, AlphaFold
prediction, or docking-output complex), prepare the system with
appropriate force fields, run a configurable production simulation
(default 100 ns), analyze the trajectory for RMSF, dynamic cross-
correlation, principal-component motion modes, and persist all
outputs as a versioned MD-run artifact. Used by Theorist to ground
"protein dynamics" claims in actual simulation, not literature
hand-waving.

Why this matters

The molecular-dynamics skill exists (.claude/skills/molecular-dynamics)
but no Forge tool composes it into a hypothesis-grounding workflow.
A debate over "this loop is conformationally flexible and that's why
the drug doesn't bind well" needs MD evidence to settle; today the
Skeptic has no way to demand it. A reproducible MD pipeline turns
this from a literature-citation contest into an experimental-evidence
exchange — a much higher epistemic bar.

Acceptance Criteria

☐ New module scidex/forge/md_pipeline.py (≤900 LoC):
- prepare_system(structure, force_field='amber14',
water_model='tip3p', ions='0.15 M NaCl')
— uses OpenMM's
Modeller to add hydrogens, solvate, neutralize; returns
system, topology, positions.
- run_production(system, topology, positions, ns=100,
timestep_fs=2.0)
— runs production MD with a Langevin
integrator at 310 K; periodic checkpointing so partial
runs are recoverable.
- analyze_trajectory(traj_path) — uses MDAnalysis to compute
RMSF per residue, dynamic cross-correlation matrix, PCA on
Cα coordinates yielding top-5 motion modes, hydrogen-bond
occupancy, secondary-structure flips.
- pipeline(structure_or_pdb_id, ns=100) — composes; commits
outputs to data/scidex-artifacts/md/<structure>/<run_id>/
including the trajectory (xtc), the analysis JSON, and
SVG plots for each metric.
☐ Migration md_run(run_id PRIMARY KEY, structure_source,
pdb_id_or_artifact_id, force_field, water_model, ions, length_ns,
timestep_fs, n_atoms, runtime_seconds, mean_rmsf, top_pca_variance,
hardware_profile, started_at, finished_at, artifact_id)
.
tools.py registers md_simulation_pipeline(pdb_id, ns=100)
with @log_tool_call; default ns is 100; CPU runs hard-cap at
10 ns to prevent runaway.
/artifacts/<id> renders RMSF plot, top-3 PCA motion-mode
animations (GIF), and a residue-flexibility heatmap.
☐ Skeptic persona prompt gains an md_evidence block when the
hypothesis names a protein with a completed MD run (mean RMSF +
top-PCA variance summary).
☐ Acceptance: python -m scidex.forge.md_pipeline --pdb 1AKI --ns
10 (lysozyme, fast control) completes <30 min on the build
host; mean RMSF in expected range (1-3 Å for the loop region);
artifact registered.
☐ Resource governance: get-available-resources queried before
run; refuses if estimated runtime > 24 h or RAM < 8 GB.

Approach

  • OpenMM force-field selection follows the standard Amber14 + TIP3P
  • recipe for proteins; small-molecule ligands get openff parameters
    via openmmforcefields.
  • Production runs write trajectory in xtc (compressed) every 10 ps;
  • checkpoint state every 1 ns.
  • Analysis is on-demand and idempotent — re-running on the same
  • trajectory regenerates plots without rerunning MD.
  • Hardware profiling captured to enable cross-host reproducibility.
  • Skeptic-prompt injection mirrors GTEx pattern in
  • scidex/senate/personas/skeptic.py (or wherever skeptic prompts
    are assembled).

    Dependencies

    • molecular-dynamics skill (OpenMM + MDAnalysis).
    • get-available-resources skill.
    • data/scidex-artifacts/ submodule.
    • q-tool-drug-docking-workflow — produces complexes that benefit
    from MD validation.

    Work Log

    2026-04-27 — Implemented (commit a821d0177)

    • scidex/forge/md_pipeline.py (592 LoC):
    - prepare_system(structure, force_field, water_model, ions) — resolves PDB ID / AlphaFold / local file via _resolve_structure, fixes PDB with PDBFixer, builds solvated system with OpenMM Modeller, serialises to XML.
    - run_production(prepared, ns, timestep_fs) — LangevinMiddleIntegrator at 310 K, GPU platform auto-detection (CUDA/OpenCL/Metal), DCDReporter (10 ps frames), CheckpointReporter (1 ns), energy minimisation before MD. CPU capped at 10 ns.
    - analyze_trajectory(traj_path, topology_path, run_id) — MDAnalysis: RMSF per residue, PCA variance components, vectorised DCCM on Cα z-coordinates, H-bond occupancy, DSSP secondary structure. Plots: RMSF bar chart, PCA variance bar, DCCM heatmap as SVG.
    - pipeline(structure_or_pdb_id, ns) — full pipeline with resource governance, artifact registration, DB write.
    - get_recent_md_run(pdb_id_or_gene) — DB lookup for Skeptic injection.
    • migrations/add_md_run_table.py: md_run table with run_id PK, indexes on pdb_id_or_artifact_id, finished_at, artifact_id.
    • scidex/forge/tools.py: md_simulation_pipeline(pdb_id, ns=100) registered with @require_preregistration @log_tool_call.
    • agent.py: _md_evidence_block injected into Skeptic prompt alongside _tissue_table_block — mirrors GTEx/dossier pattern, queries get_recent_md_run for top 2 genes.
    • Resource governance: _check_resources(ns) queries get-available-resources skill, refuses if RAM < 8 GB or estimated runtime > 24 h.
    Not yet done: /artifacts/<id> rendering (RMSF plot, PCA GIF animations, flexibility heatmap) — requires api.py changes for artifact viewer; tracked as follow-on.

    Tasks using this spec (1)
    [Forge] End-to-end MD-simulation analysis - input to OpenMM
    Forge done P88
    File: q-tool-md-simulation-analysis_spec.md
    Modified: 2026-05-01 20:13
    Size: 5.8 KB