> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.
Effort: quick
The 2026-05-18 artifact-file recovery session leaned heavily on the sibling
.list.txt files that the backup pipeline emits next to each tarball under
s3://scidex/backups/${HOST}/.... Those listings let recovery code locate
a specific file inside a 10+ GB tarball without downloading the tarball
first — without them, recovery degrades to "fetch the tarball, untar, grep,
toss" which makes the cost of investigating any one missing file
prohibitive.
Current coverage as of the recovery session: 1,924 tarballs have sibling
listings. New tarballs created since the listing pipeline landed (see
reference_backup_architecture.md, 2026-04-24 consolidation) are supposed
to emit them automatically — but there's no monitor confirming that, and
any future regression (a flag flipped, a worker swallowing the listing
step on error) won't be noticed until the next recovery session needs the
listings and finds them missing.
Add a periodic verifier job that lists every .tar. under
s3://scidex/backups/ and flags any tarball without a sibling .list.txt
of non-trivial size. The job emits a metric / event row that the fleet
health dashboard can display and fails (non-zero exit) above a configurable
threshold of missing listings.
scripts/verify_s3_backup_listings.py runs tos3://scidex/backups/ and emits a structuredsync-full-s3.sh completesreference_backup_architecture.md). On the SciDEX hosts that meanssync-full-s3.sh, not a newfleet_health_events (or whateverOrchestra uses for fleet metrics — confirm at implementationaws s3api list-objects-v2 (paginated) against the backups bucket<tarball>.list.txt exists and has sizefleet_health_events.sync-full-s3.sh step so it runs once per day.unassigned