[Atlas] Re-download missing paper_figure files from image_url with 404 marker

← All Specs

> v1 freeze note (2026-05-13): SciDEX v1 is frozen for code changes
> (see AGENTS.md § "v1 FROZEN — No Code Changes"). This spec touches
> v1 PG data + would land new scripts in v1, so it cannot be implemented
> in v1 by default. Two viable paths: (a) redirect the work into
> SciDEX-Substrate (the v2 backend) if/when substrate has migrated
> the relevant data, or (b) request the narrow "data-corruption fix"
> carve-out from a human, with the new code framed as read-only repair
> against the v1 DB. Until one of those happens, this spec is captured
> for the record but not actionable.

Effort: standard

Background

The 2026-05-18 SciDEX artifact-file recovery session left only 15 of 4,302 paper_figure artifacts with a local file on disk (4,287 missing — 99.7%
gap). Each row carries an image_url in metadata (Europe PMC / PMC OA /
publisher CDN), but a sampling pass during the recovery session showed Europe
PMC returns HTTP 404 for ~97% of those URLs. The remaining ~3% are still
fetchable and would close the gap to ~4,150 missing rows; the rest need a
durable "we tried and the upstream is gone" marker so downstream code
doesn't keep retrying.

This is distinct from quest_engine_paper_figure_extraction_backfill_spec.md,
which extracts NEW figures from papers that have no paper_figures row at
all. Here every row already exists; the file behind the URL is what is
missing.

Goal

Run a rate-limit-aware redownload pass over the 4,287 paper_figure rows
without a local file, persist successful fetches to the canonical figures
path, and stamp metadata.image_unavailable=true + a metadata.image_404_at
timestamp on rows whose URL is confirmed dead so we stop re-attempting them.

Out of scope

  • Re-deriving image_url for rows whose URL was never recorded (separate
problem, needs a re-extraction pass).
  • LLM-based figure description regeneration (figure_generator quest).
  • Files that exist locally but are corrupt (covered by the file_sha256
backfill spec).

Acceptance criteria

☐ All 4,287 paper_figure rows without a file have been attempted at
least once with the recorded image_url.
☐ Successful downloads land at the canonical figures path
(scidex.core.paths resolver, NOT a hardcoded site/figures/papers/)
and the file size + sha256 are recorded in metadata.file_sha256 and
metadata.file_size_bytes.
☐ Rows with a confirmed 404/410/451 carry
metadata.image_unavailable=true and metadata.image_404_at=<iso>.
☐ Rows that errored transiently (timeout, 5xx, connection reset) are
NOT marked unavailable; they remain retry-eligible.
☐ A summary report records: attempted, fetched, hard-404, transient,
per-host counts.

Plan

  • Query paper_figure artifact rows with no file_path (or file_path
  • that does not exist on disk per the recovery session's findings).
  • Group by URL host; throttle per-host concurrency (Europe PMC ≤ 4 inflight,
  • 2 req/s; PMC OA ≤ 2 req/s; arbitrary publisher CDNs ≤ 1 req/s default).
  • For each row: HEAD-then-GET; if 200 store under canonical path and
  • compute sha256; if 404/410/451 mark image_unavailable; else leave
    alone and log retryable reason.
  • Commit fetched files to the SciDEX-Artifacts submodule via
  • scidex.atlas.artifact_commit.commit_artifact — never raw git add
    against the canonical checkout (see AGENTS.md "Artifacts" section).
  • Emit a final summary into the spec's Work Log.
  • Risks

    • Europe PMC has previously rate-limited bulk crawlers; honor Retry-After
    and back off on 429. Do NOT parallelize beyond the per-host budget above.
    • A blanket "mark all 404 as unavailable" without distinguishing transient
    failures locks the rows out of future retries. The transient list must be
    re-runnable.
    • 4,287 small image fetches plus per-file submodule commits will create
    4K+ commits in SciDEX-Artifacts. Batch commits (e.g. 100 files per
    commit) instead of one-per-file.

    Owner

    unassigned

    File: 2026-05-18_paper_figure_url_redownload_spec.md
    Modified: 2026-05-19 20:53
    Size: 4.1 KB