> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.
Effort: deep
Background
The 2026-05-18 artifact-file recovery session covered figure, notebook,
analysis, and paper_figure types but never attempted file recovery for
three smaller artifact types:
| artifact_type | rows | files on disk |
|---|
dataset | 132 | 0 |
model | 9 | 0 |
tabular_dataset | 4 | 0 |
145 rows total, zero files recovered. These were skipped because the larger
types ate the recovery window, not because they're known unrecoverable. The
storage layout, expected file extensions, and original creator code paths
differ enough between the three that a separate first-pass is needed.
Goal
Inventory whatever files actually exist for dataset, model, and
tabular_dataset across the local SciDEX-Artifacts checkout, all known
recovery worktrees, and the most recent two S3 backup tarballs, then bind
the DB rows to the located files. Produce a residual list of rows where no
file was findable anywhere.
Out of scope
- Re-generating datasets/models from scratch by re-running the original
notebook (separate quest if needed).
- Regenerating tabular datasets from external sources (e.g. re-pulling
GTEx); those should be tracked under
versioned_datasets_spec.md.
Acceptance criteria
☐ An inventory artifact exists per type listing: DB rows, expected
basename(s), candidate files found, match confidence, and final
bind status.
☐ At least one of dataset/model/tabular_dataset moves above zero files
on disk; the goal is a real first pass, not zero progress.
☐ Every DB row that ends up bound has metadata.file_sha256 and
metadata.file_size_bytes populated (overlaps with the
file_sha256_backfill_spec; do this inline since you already have
the file in hand).
☐ A residual list names the rows where no file was located anywhere,
with the most plausible failure mode (e.g. "expected
.parquet,
none on disk, last referenced 2026-03-15 in analysis SDA-…").
Plan
Query DB rows per type. Read the original creator code for each type
to learn where files were supposed to land:
-
dataset — economics_drivers/datasets/cli.py and
datasets/ paths in scidex-artifacts.
-
model — model export paths (
models/) plus any
forge_runs/ joblib
dumps.
-
tabular_dataset —
datasets/tabular/ and CSV under
data/scidex-artifacts/datasets/.
Walk each candidate directory in the canonical artifacts checkout AND in
.orchestra-worktrees/*/data/scidex-artifacts/ for short-lived worktree
spillover (this is how the figure recovery found its missing files).
Probe the most recent two S3 backup tarballs at
s3://scidex/backups/${HOST}/... — list the tarball, grep for the
expected basenames, extract only the matches.
Bind matches via the journaled helper. Compute sha256 + size as you go.
Write the residual list to the spec's Work Log.Risks
- Datasets are larger than figures; pulling many from S3 tarballs can move
10s of GB of intermediate data. Stream-extract only the matched files,
do not unpack whole tarballs.
model rows may reference frozen weights that were never committed
(RAM-only artifacts). Treat those as "permanently lost" rather than
retrying — same
metadata.image_unavailable-style marker but
metadata.file_unrecoverable=true with a reason string.
- Resist generic-similarity binds. With only 145 rows, hand-verification
before each unusual bind is cheap and prevents the same silent-rebind
failure mode as the notebook rename map.
Owner
unassigned