[Atlas] Decide and apply cleanup for 1,421 figure artifacts with no file

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: deep

Background

After the 2026-05-18 recovery session, 1,421 of 8,120 figure artifact
rows have no file anywhere — not on disk, not in any of the .orchestra-worktrees/* candidate dirs, not in the two most recent S3
backup tarballs that were sampled. The recovery session's working theory
is that these rows were generated inside short-lived worker worktrees
that got reaped before the file made it into the canonical SciDEX-Artifacts submodule (the commit_artifact write-through path is
mandatory but pre-dated some of the older figure generation code).

There are three plausible dispositions:

  • Hard-delete the rows. Smallest blast radius, but breaks any
  • referrer (artifact_links, hypothesis evidence, analysis HTML that
    embeds the figure URL).
  • Mark deprecated_reason='no_file' so the lifecycle state
  • reflects reality and downstream renderers can hide the dead embed.
  • Leave as-is — risky; the figure URL keeps appearing in pages and
  • returns 404 forever.

    The right answer depends on how heavily-referenced the rows are. We don't
    know without checking; a blanket mass-delete would be the kind of
    catastrophic action the post-merge guard exists to prevent.

    Goal

    Decide and apply a disposition for the 1,421 orphan figure rows that is
    proportional to their reference count: hard-delete the truly orphaned, mark
    deprecated for the linked-but-unreachable, leave alone the ones that turn
    out to be findable after all (with rebind via a quick last-pass).

    Out of scope

    • The corresponding orphan storage files (figure files on disk with no
    DB row). They're rare for figure (vs the notebook problem) but a
    smaller follow-up if any are found.
    • Re-generating the figures from source code. Most originating notebooks
    ran in ephemeral worktrees and are themselves missing.

    Acceptance criteria

    ☐ A reference-count report exists for each of the 1,421 rows showing
    inbound artifact_links, hypothesis_papers, analysis HTML embed
    count, and wiki {{artifact:ID}} marker count.
    ☐ Rows with zero references are hard-deleted (still via the journaled
    helper, never raw DELETE).
    ☐ Rows with non-zero references get
    lifecycle_state='deprecated' and
    deprecated_reason='no_file_after_2026_05_18_recovery'.
    ☐ No row is touched without an explicit human-readable disposition
    recorded in the spec's Work Log.
    ☐ A post-pass spot check confirms no lifecycle_state='active' figure
    row has zero file on disk after this pass.

    Plan

  • Re-verify the 1,421 list against current state (rows may have been
  • bound by other concurrent work since the recovery session ran).
  • For each row, count inbound references across artifact_links,
  • hypothesis_evidence joins, wiki content_md {{artifact:ID}}
    markers, and analyses' rendered HTML embeds.
  • Partition into three buckets — zero-ref, low-ref (<5), referenced (≥5).
  • Walk the zero-ref bucket and hard-delete (journaled). Walk the low-ref
  • and referenced buckets and mark deprecated. Record the partition
    counts and a sample of each into the Work Log.
  • Spot-check by re-running the recovery inventory afterward and
  • confirming no active figure rows lack a file.

    Risks

    • "Zero references" today may be "five references tomorrow" if some
    enrichment job is mid-flight. Run when fleet write traffic is low and
    re-verify just before delete.
    • A row may have its file restored by the figure URL re-download spec
    running in parallel. Coordinate by re-querying right before the
    delete batch.
    • This is the largest destructive action implied by the recovery work.
    Treat it as such — open a PR with the partition counts in the body
    and get human review before the delete bucket runs.

    Owner

    unassigned

    File: 2026-05-18_orphan_figure_cleanup_spec.md
    Modified: 2026-05-19 20:53
    Size: 4.5 KB