[Atlas] First-pass file recovery for dataset / model / tabular_dataset artifacts

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: deep

Background

The 2026-05-18 artifact-file recovery session covered figure, notebook,
analysis, and paper_figure types but never attempted file recovery for
three smaller artifact types:

artifact_type	rows	files on disk
`dataset`	132	0
`model`	9	0
`tabular_dataset`	4	0

145 rows total, zero files recovered. These were skipped because the larger
types ate the recovery window, not because they're known unrecoverable. The
storage layout, expected file extensions, and original creator code paths
differ enough between the three that a separate first-pass is needed.

Goal

Inventory whatever files actually exist for dataset, model, and tabular_dataset across the local SciDEX-Artifacts checkout, all known
recovery worktrees, and the most recent two S3 backup tarballs, then bind
the DB rows to the located files. Produce a residual list of rows where no
file was findable anywhere.

Out of scope

Re-generating datasets/models from scratch by re-running the original

notebook (separate quest if needed).

Regenerating tabular datasets from external sources (e.g. re-pulling

GTEx); those should be tracked under versioned_datasets_spec.md.

Acceptance criteria

☐ An inventory artifact exists per type listing: DB rows, expected

basename(s), candidate files found, match confidence, and final
bind status.

☐ At least one of dataset/model/tabular_dataset moves above zero files

on disk; the goal is a real first pass, not zero progress.

☐ Every DB row that ends up bound has metadata.file_sha256 and

metadata.file_size_bytes populated (overlaps with the
file_sha256_backfill_spec; do this inline since you already have
the file in hand).

☐ A residual list names the rows where no file was located anywhere,

with the most plausible failure mode (e.g. "expected .parquet,
none on disk, last referenced 2026-03-15 in analysis SDA-…").

Plan

Query DB rows per type. Read the original creator code for each type

to learn where files were supposed to land:
- dataset — economics_drivers/datasets/cli.py and
datasets/ paths in scidex-artifacts.
- model — model export paths (models/) plus any forge_runs/ joblib
dumps.
- tabular_dataset — datasets/tabular/ and CSV under
data/scidex-artifacts/datasets/.

Walk each candidate directory in the canonical artifacts checkout AND in

.orchestra-worktrees/*/data/scidex-artifacts/ for short-lived worktree
spillover (this is how the figure recovery found its missing files).

Probe the most recent two S3 backup tarballs at

s3://scidex/backups/${HOST}/... — list the tarball, grep for the
expected basenames, extract only the matches.

Bind matches via the journaled helper. Compute sha256 + size as you go.

Write the residual list to the spec's Work Log.

Risks

Datasets are larger than figures; pulling many from S3 tarballs can move

10s of GB of intermediate data. Stream-extract only the matched files,
do not unpack whole tarballs.

model rows may reference frozen weights that were never committed

(RAM-only artifacts). Treat those as "permanently lost" rather than
retrying — same metadata.image_unavailable-style marker but
metadata.file_unrecoverable=true with a reason string.

Resist generic-similarity binds. With only 145 rows, hand-verification

before each unusual bind is cheap and prevents the same silent-rebind
failure mode as the notebook rename map.

Owner

unassigned

File: 2026-05-18_dataset_model_file_recovery_spec.md

Modified: 2026-05-19 20:53

Size: 4.3 KB