> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.
Effort: deep
Background
The 2026-05-18 artifact-file recovery session found notebook artifacts
at 738 / 1,017 with a file on disk. Of the 280 missing rows, the inventory
showed:
- 357 DB basenames have no file anywhere across 600+ unique storage basenames.
- 231 storage files have no DB row pointing at them.
This is too lopsided to be pure data loss — it looks like a rename pattern.
Example pairs observed during the recovery session:
DB row file_path : nb-sda-2026-04-01-gap-001.ipynb
Orphan file in storage : SDA-2026-04-01-gap-001.ipynb
i.e. an early casing/prefix convention (nb-sda-…) was replaced by the
SDA-id convention without updating the corresponding DB rows. A mechanical
exact-basename match misses these; a content-hash or prefix-strip pass would
recover most of them.
Goal
Build a rename map by matching the 357 DB-orphan basenames against the 231
storage-orphan basenames using (a) prefix-strip normalization, (b) trailing-
suffix tolerance, and (c) content-similarity as a tiebreaker. For each
confirmed match, update the DB row's file_path to the surviving storage
basename, then re-verify file-on-disk presence.
Out of scope
- Rebinding rows whose file is genuinely lost (no storage candidate matches
even after normalization). Those stay missing; a separate cleanup spec
can decide deletion vs deprecation.
- The 9 stale notebook artifacts on the pre-migration legacy schema (deferred
from the SQLite retirement on 2026-04-28 — different problem).
Acceptance criteria
☐ A notebook_rename_map.json exists in the recovery output dir
enumerating every (db_basename → storage_basename) pair with the
match-confidence label (
prefix_strip /
casing_only /
suffix_only /
content_hash /
manual_review).
☐ Every prefix_strip / casing_only match has been applied; DB row
file_path now points at a real file on disk.
☐ content_hash matches require a sha256 file-content match against
whatever artifact the DB row was originally generated against; no
blind content-similarity rebinds without that anchor.
☐ Notebook-on-disk count rises from 738 to ≥ 900 (target: pull most of
the 231 storage orphans into rows).
☐ A residual report names the rows still missing post-pass and their
probable cause (no storage candidate / ambiguous match / etc.).
Plan
Re-enumerate from authoritative sources: DB rows where artifact_type =
'notebook', joined with on-disk inventory of
data/scidex-artifacts/notebooks/ (resolve via
scidex.core.paths).
Compute the rename-candidate map: for each DB-orphan basename, try in
order — exact, case-insensitive, prefix-strip (
nb-sda- →
SDA-),
suffix-tolerant (date-only or run-id stripped). Multi-candidate matches
require a tiebreaker.
Where the DB row recorded a metadata.notebook_hash or
metadata.file_sha256, recompute sha256 on each candidate storage file
and require an exact hash match before rebinding.
Apply rebinds through db_writes.update_artifact_file_path() (or the
equivalent journaled helper — add one if it doesn't exist; do NOT raw
UPDATE) so we have an audit trail.
Reuse-check: after rebind, count notebooks-with-files. Write the residual
list to the spec's Work Log.
Risks
- A wrong rebind silently links a row to a file that is similar-named but
semantically different. Hash-anchored matches are safe; pure name-based
matches must be conservative (e.g. only when no other candidate exists).
- 231 storage orphans + 357 DB orphans is not 1:1. Some storage files belong
to multiple DB rows (duplicate analyses re-run); the map must handle a
storage basename serving more than one row.
- Renaming under a worker contending with other agents on the same file:
acquire row-level locks (the
db_writes helper does this) and run the
full pass in one transaction batch per ~50 rows.
Owner
unassigned