[Atlas] Backfill metadata.file_sha256 for all artifacts with a local file

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: standard

Background

scidex.artifacts.content_hash is a dedup key computed as sha256(title + canonicalized_metadata). It is NOT a hash of the artifact's
file. This was conflated for months — including during the 2026-05-18
recovery session — and the conceptual correction made it into memory
(project_scidex_artifact_content_hash_semantics.md) but the corresponding
backfill of an actual file hash was never run.

figure_generator already writes metadata.file_sha256 = sha256(file) on
new figures. Older figures, all notebooks pre-2026-04, analyses, datasets,
models, and most paper_figures have no recorded file hash at all. With
~7,700 artifact files now on disk across types, that's a non-trivial
gap for any future integrity verification.

Goal

For every artifact whose file_path resolves to a real file on disk,
compute sha256 and write it to metadata.file_sha256 (preserving any
existing key with the same value; updating if drift is detected with a
loud log line). Also record metadata.file_size_bytes and metadata.file_sha256_computed_at.

Out of scope

  • Recomputing artifacts.content_hash. It's the dedup key; do not touch.
  • Hashing files that don't exist (the recovery and rebind specs handle
the missing-file problem).
  • Cryptographic signing or transparency log (separate proposal).

Acceptance criteria

☐ Every artifact row where file_path resolves to a real file has
metadata.file_sha256 populated.
☐ When a stale hash already existed and disagrees with the recomputed
one, the discrepancy is logged with row id, old hash, new hash, and
file mtime, and the new hash is written.
metadata.file_size_bytes matches the on-disk size for every
backfilled row.
☐ Backfill is resumable: re-running picks up where the previous run
stopped using the file_sha256_computed_at marker.
☐ A summary names: total rows scanned, rows updated, drift events,
rows skipped (no file), per-artifact-type counts.

Plan

  • Query artifacts where file_path IS NOT NULL. Order by
  • artifact_type, id so resumes are deterministic.
  • For each row, resolve file_path via scidex.core.paths. If the file
  • doesn't exist, skip and record the row in a "no_file" residual list
    (cross-reference with the recovery spec).
  • Stream-hash the file (chunked, don't slurp — some notebooks are 200MB+)
  • and write file_sha256, file_size_bytes, file_sha256_computed_at
    via the journaled metadata-update helper.
  • On hash drift, do NOT clobber silently — log the row id and previous
  • hash for human review before overwriting. (Drift = the metadata had a
    file_sha256 already and it differs.)
  • Periodically commit so a crash mid-run doesn't waste hours of hashing.
  • Risks

    • Hashing every file on disk is I/O-heavy. Run off-peak (after 02:00
    local) or rate-limit at e.g. 50 MB/s to avoid contending with the
    fleet's normal read traffic.
    • Drift events are the interesting findings of this backfill — they're
    signals of corruption or a race that overwrote a file without
    re-registering. Do not auto-suppress them.
    • Coordinate with the dataset_model_file_recovery_spec so we don't
    hash the same files twice; whichever runs first should mark
    file_sha256_computed_at and the other will skip.

    Owner

    unassigned

    File: 2026-05-18_artifact_file_sha256_backfill_spec.md
    Modified: 2026-05-19 20:53
    Size: 4.0 KB