[Atlas] Rebind notebook artifacts whose storage basename changed

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: deep

Background

The 2026-05-18 artifact-file recovery session found notebook artifacts
at 738 / 1,017 with a file on disk. Of the 280 missing rows, the inventory
showed:

  • 357 DB basenames have no file anywhere across 600+ unique storage basenames.
  • 231 storage files have no DB row pointing at them.

This is too lopsided to be pure data loss — it looks like a rename pattern.
Example pairs observed during the recovery session:

DB row file_path           : nb-sda-2026-04-01-gap-001.ipynb
Orphan file in storage     : SDA-2026-04-01-gap-001.ipynb

i.e. an early casing/prefix convention (nb-sda-…) was replaced by the
SDA-id convention without updating the corresponding DB rows. A mechanical
exact-basename match misses these; a content-hash or prefix-strip pass would
recover most of them.

Goal

Build a rename map by matching the 357 DB-orphan basenames against the 231
storage-orphan basenames using (a) prefix-strip normalization, (b) trailing-
suffix tolerance, and (c) content-similarity as a tiebreaker. For each
confirmed match, update the DB row's file_path to the surviving storage
basename, then re-verify file-on-disk presence.

Out of scope

  • Rebinding rows whose file is genuinely lost (no storage candidate matches
even after normalization). Those stay missing; a separate cleanup spec
can decide deletion vs deprecation.
  • The 9 stale notebook artifacts on the pre-migration legacy schema (deferred
from the SQLite retirement on 2026-04-28 — different problem).

Acceptance criteria

☐ A notebook_rename_map.json exists in the recovery output dir
enumerating every (db_basename → storage_basename) pair with the
match-confidence label (prefix_strip / casing_only / suffix_only /
content_hash / manual_review).
☐ Every prefix_strip / casing_only match has been applied; DB row
file_path now points at a real file on disk.
content_hash matches require a sha256 file-content match against
whatever artifact the DB row was originally generated against; no
blind content-similarity rebinds without that anchor.
☐ Notebook-on-disk count rises from 738 to ≥ 900 (target: pull most of
the 231 storage orphans into rows).
☐ A residual report names the rows still missing post-pass and their
probable cause (no storage candidate / ambiguous match / etc.).

Plan

  • Re-enumerate from authoritative sources: DB rows where artifact_type =
  • 'notebook', joined with on-disk inventory of
    data/scidex-artifacts/notebooks/ (resolve via scidex.core.paths).
  • Compute the rename-candidate map: for each DB-orphan basename, try in
  • order — exact, case-insensitive, prefix-strip (nb-sda-SDA-),
    suffix-tolerant (date-only or run-id stripped). Multi-candidate matches
    require a tiebreaker.
  • Where the DB row recorded a metadata.notebook_hash or
  • metadata.file_sha256, recompute sha256 on each candidate storage file
    and require an exact hash match before rebinding.
  • Apply rebinds through db_writes.update_artifact_file_path() (or the
  • equivalent journaled helper — add one if it doesn't exist; do NOT raw
    UPDATE) so we have an audit trail.
  • Reuse-check: after rebind, count notebooks-with-files. Write the residual
  • list to the spec's Work Log.

    Risks

    • A wrong rebind silently links a row to a file that is similar-named but
    semantically different. Hash-anchored matches are safe; pure name-based
    matches must be conservative (e.g. only when no other candidate exists).
    • 231 storage orphans + 357 DB orphans is not 1:1. Some storage files belong
    to multiple DB rows (duplicate analyses re-run); the map must handle a
    storage basename serving more than one row.
    • Renaming under a worker contending with other agents on the same file:
    acquire row-level locks (the db_writes helper does this) and run the
    full pass in one transaction batch per ~50 rows.

    Owner

    unassigned

    File: 2026-05-18_notebook_rename_rebind_spec.md
    Modified: 2026-05-19 20:53
    Size: 4.6 KB