[Atlas] Periodic verifier that every S3 backup tarball has a sibling .list.txt

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: quick

Background

The 2026-05-18 artifact-file recovery session leaned heavily on the sibling .list.txt files that the backup pipeline emits next to each tarball under s3://scidex/backups/${HOST}/.... Those listings let recovery code locate
a specific file inside a 10+ GB tarball without downloading the tarball
first — without them, recovery degrades to "fetch the tarball, untar, grep,
toss" which makes the cost of investigating any one missing file
prohibitive.

Current coverage as of the recovery session: 1,924 tarballs have sibling
listings. New tarballs created since the listing pipeline landed (see reference_backup_architecture.md, 2026-04-24 consolidation) are supposed
to emit them automatically — but there's no monitor confirming that, and
any future regression (a flag flipped, a worker swallowing the listing
step on error) won't be noticed until the next recovery session needs the
listings and finds them missing.

Goal

Add a periodic verifier job that lists every .tar. under s3://scidex/backups/ and flags any tarball without a sibling .list.txt
of non-trivial size. The job emits a metric / event row that the fleet
health dashboard can display and fails (non-zero exit) above a configurable
threshold of missing listings.

Out of scope

  • Generating missing listings retroactively. If the job finds gaps, that's
a separate follow-up — this spec is verification, not remediation.
  • Validating listing CONTENT (e.g. checksumming entries against the
tarball). That would be useful but is much more expensive; v1 only
checks existence and size > 0.
  • Cross-host orchestration. The job scans whatever HOST partition it's
pointed at; running on multiple hosts is a deployment detail.

Acceptance criteria

☐ A new script scripts/verify_s3_backup_listings.py runs to
completion against s3://scidex/backups/ and emits a structured
summary (total tarballs, listings present, listings missing,
listings too small).
☐ A scheduled job runs the script after sync-full-s3.sh completes
each day (the periodic full-sync run is documented in
reference_backup_architecture.md). On the SciDEX hosts that means
the existing systemd timer that wraps sync-full-s3.sh, not a new
cron.
☐ The job writes a row to fleet_health_events (or whatever
Orchestra uses for fleet metrics — confirm at implementation
time) so the watchdog can surface gaps.
☐ Exits non-zero when missing-listing count exceeds a threshold (e.g.
0 by default for new backups, with a one-off allowance for the
historical pre-listing backlog).
☐ No false positives on the pre-listing historical backlog — those
are excluded by cutoff date or by an explicit allowlist.

Plan

  • Use aws s3api list-objects-v2 (paginated) against the backups bucket
  • to enumerate tarballs and their would-be sibling listings.
  • For each tarball, check that <tarball>.list.txt exists and has size
  • > 0. Record missing/zero-size entries.
  • Apply the historical-backlog cutoff (any tarball older than the
  • listing-pipeline land date is exempt unless explicitly opted in).
  • Emit summary to stdout and structured event to fleet_health_events.
  • Wire into the post-sync-full-s3.sh step so it runs once per day.
  • Risks

    • Listing 1,900+ tarballs adds an S3 API cost; use the paginated list
    endpoint and limit to one head-object per tarball (or fewer if
    list-objects already returns the listing keys in the same response).
    • The verifier itself becoming the regression-risk: if the script throws
    silently, the fleet would think coverage is fine. Wire it with a
    no-output-means-failure signal in the watchdog.

    Owner

    unassigned

    File: 2026-05-18_s3_listings_verifier_spec.md
    Modified: 2026-05-19 20:53
    Size: 4.4 KB