[Substrate] Paper PDF + figure caching with per-paper artifact folders

← All Specs

Status: proposed (open design questions — see § Open decisions) Layer: Atlas (paper cache) / Forge (figure extraction) Owner: unassigned Created: 2026-05-18 Effort: deep Target repo: [SciDEX-Substrate](https://github.com/SciDEX-AI/SciDEX-Substrate) — v1 is frozen (2026-05-13); ship there, not in v1.

Problem

SciDEX has cached paper metadata since day one (data/scidex-papers/<pmid>.json, ~2,875 files, ~21 MB), but has never cached PDFs or figure images. Concretely:

paper_cache.py:_write_file_cache() writes JSON only.
papers.local_path exists but is empty for every row.
papers.fulltext_cached flag was retrofitted by quest_engine_paper_fulltext_cache_backfill_spec to mean "PMC XML/plaintext stored inline in the JSON" — there is still no PDF on disk.
papers.figures_extracted is set by backfill_figures.py / extract_figures_for_batch.py, but figure files land in data/scidex-artifacts/figures/papers/<pmid>/ (separate from the paper folder) and most rows have only remote URLs in paper_figures.image_path.
The page_cache.db SQLite file in the repo root caches HTML page renders and PMC figure URL resolutions — it is not a paper PDF cache.
scidex/core/paths.py defines PAPER_PDF_DIR and PAPER_CLAIMS_DIR but nothing writes there.

Net effect: every paper read by an agent reaches into a remote API (PubMed, Unpaywall, PMC) for full text or figures every time. We lose reproducibility, rate-limit headroom, and offline-replay capability.

The flat layout (<pmid>.json at the root of the papers submodule) also has a scaling ceiling — at the current growth rate (recent batches add ~2,250 papers per push) we will exceed 10K entries in a single directory inside ~6 months.

Goal

For every paper fetch in substrate v2, materialize the paper as a first-class artifact with its own per-paper folder under data/scidex-papers/ (or substrate's equivalent root). Each folder holds:

data/scidex-papers/<aa>/<paper_uuid>/
  paper.json          # metadata (today's payload)
  paper.pdf           # full text PDF — when an open-access source exists
  paper.fulltext.xml  # PMC XML when available (current quest_engine_paper_fulltext output)
  figures/<fig_id>.png|jpg
  figures/<fig_id>.caption.txt
  manifest.json       # artifact manifest (matches commit_artifact_to_folder convention)

<aa> = first 2 hex chars of paper_uuid, same shard scheme as scidex.core.paths.artifact_dir() (200+ shards keep per-dir entry counts below 4K at 1M-paper ceiling — git-objects-style).

Path resolution lives in substrate's equivalent of scidex/core/paths.py via three new helpers:

paper_dir(paper_artifact_id) -> Path — mirrors artifact_dir().
paper_pdf_path(paper_artifact_id) -> Path
paper_json_path(paper_artifact_id) -> Path
paper_figures_dir(paper_artifact_id) -> Path

Read-time fallback: if <aa>/<uuid>/paper.json is missing, fall back to legacy flat <pmid>.json so existing v1-frozen artifacts remain readable until the migration is complete.

Final architecture

papers table (PG, substrate)
  paper_id  (existing, paper-<pmid> form)
  artifact_id UUID  ← NEW, FK to artifacts(artifact_id)
  pmid, doi, pmc_id, title, abstract, ...   ← unchanged
  fulltext_cached  (existing, 0|1) — true iff PDF or PMC XML on disk
  figures_extracted (existing, 0|1) — true iff figures/ dir non-empty

artifacts table (PG, substrate)
  artifact_id UUID  (primary)
  artifact_type = 'paper'  ← new value (joins existing enum)
  ...standard artifact columns (origin_type, version_number, ...)

data/scidex-papers/<aa>/<paper_uuid>/
  paper.json
  paper.pdf            (when fetchable)
  paper.fulltext.xml   (when fetchable)
  figures/<n>.png
  manifest.json

paper_figures table
  unchanged columns; image_path now points into the paper folder
  (data/scidex-papers/<aa>/<paper_uuid>/figures/<n>.png) instead of
  data/scidex-artifacts/figures/papers/<pmid>/...

PDF fetch pipeline (substrate side, new module — proposed substrate/papers/pdf_fetcher.py):

PMC OA Service — for any paper with a pmc_id, try https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={pmc_id} and follow the FTP tarball. Highest yield for biomed (~30-40%).

Unpaywall — https://api.unpaywall.org/v2/{doi}?email=… → best_oa_location.url_for_pdf. Adds another ~30%.

Direct preprint URLs — arXiv (https://arxiv.org/pdf/<id>.pdf), bioRxiv/medRxiv (https://www.biorxiv.org/content/{doi}.full.pdf).

Paywalled / no OA source — write JSON only, set fulltext_cached=0, log the skip with reason in the manifest. Do not write empty PDFs.

Open decisions (need user sign-off before implementation)

#	Decision	Recommended	Tradeoff
1	Folder name = `<paper_uuid>` (new UUID column) or reuse `paper-<pmid>`	UUID	UUID = consistent with `artifact_dir()`, clean join via `papers.artifact_id`. Costs a backfill migration for the 2,875 frozen-v1 papers if we ever want to migrate them — but v1 is frozen so we can leave v1's flat layout alone and start sharded in v2.
2	Extracted figures co-located with paper (`papers/<aa>/<uuid>/figures/`) or as standalone artifacts (`artifacts/<aa>/<fig_uuid>/`)	Co-located	Co-located = one tree per paper, simpler backup/delete. Standalone = each figure is a first-class artifact in its own shard, deduplicates reused figures, but you chase pointers to assemble a paper. `paper_figures` keeps the existing `artifact_id` row for KG linkage either way — only the file location differs.
3	Submodule transport: plain git vs git-LFS for `*.pdf` + figure binaries	git-LFS	At ~3-5 MB per PDF × current corpus (~15K cited papers) × growth rate, plain git balloons fast. LFS adds operational surface (LFS server quota, fetch costs) but is the right answer at this scale. Alternative: rely on substrate's blob-store and keep only manifests in the submodule.
4	PDF coverage tolerance for paywalled papers	Best-effort, leave `fulltext_cached=0`	We will only capture ~50-70% of cited papers across OA sources. Paywalled-only papers keep the metadata-only treatment. No fake PDFs.

PR plan

#	Layer	Title	Adds
1	Atlas	substrate: paper path helpers + `papers.artifact_id` migration	`paper_dir()/paper_pdf_path()/paper_json_path()/paper_figures_dir()`; migration adds nullable `artifact_id` UUID + index on `papers`.
2	Atlas	substrate: register every paper as an artifact row on first cache	`paper_cache.get_paper()` write path goes through `register_artifact(artifact_type='paper')` and assigns the UUID.
3	Atlas	substrate: new sharded JSON layout + back-compat read fallback	`_write_file_cache` writes to `paper_dir(uuid)/paper.json`; `_read_file_cache` tries sharded first, falls back to legacy flat name.
4	Atlas	substrate: PDF fetcher (PMC OA → Unpaywall → preprint)	New `substrate/papers/pdf_fetcher.py`; sets `fulltext_cached=1` on success.
5	Forge	substrate: figure extraction writes to per-paper `figures/` dir	Updates `_extract_figures_*` callers to use `paper_figures_dir(uuid)`; updates `paper_figures.image_path` writes accordingly.
6	Atlas	substrate: manifest.json per paper folder	Mirrors `commit_artifact_to_folder` manifest convention.
7	Senate	Compliance probe extension	`scripts/check_artifact_compliance.py` (substrate copy) flags papers with `fulltext_cached=1` but no PDF on disk, and vice versa.
8	Atlas	Optional: LFS pointers for `.pdf` / `figures/``.{png,jpg}`	`.gitattributes` in substrate's papers root; only land if decision 3 = LFS.

Acceptance criteria

☐ In substrate, calling get_paper(pmid) for a paper with PMC OA availability results in paper.pdf on disk at data/scidex-papers/<aa>/<uuid>/paper.pdf and fulltext_cached=1 in the papers table.

☐ paper_dir(uuid) returns a path under the substrate papers root with the 2-hex shard prefix.

☐ Each cached paper has a row in artifacts with artifact_type='paper' and a manifest.json in its folder.

☐ Figure files written by _extract_figures_* live under paper_figures_dir(uuid), and paper_figures.image_path references that path.

☐ Read fallback: a paper present in the legacy flat layout (v1 frozen submodule) is still readable through get_paper() during the migration window.

☐ scripts/check_artifact_compliance.py (substrate) reports zero fulltext_cached_but_no_pdf violations after a fresh fetch cycle.

☐ Paywalled-only papers cleanly write JSON, leave fulltext_cached=0, and log a structured pdf_fetch_skip event with the reason (paywalled, no_oa_record, unpaywall_404, etc.).

Out of scope

v1 backfill — the SciDEX v1 papers submodule is pinned at ac08e0ae (tag v1-frozen-2026-05-13) and is not getting updated. Substrate starts with its own clean papers store.
Replacing page_cache.db (UI page render cache + PMC figure URL cache). Distinct mechanism, different lifecycle.
Full-text claim extraction (covered by extract_paper_claims_3ef88d85_spec).

Dependencies

Substrate v2 must have a papers table with the columns listed above. Today's PG papers schema is close — only artifact_id UUID is new.
Substrate artifacts enum must accept artifact_type='paper'.
Decision on git-LFS vs. blob-store transport (#3 above) before PR 4 lands.

Dependents

Claim extraction, figure extraction, replication audits, oncology-skeptic & infectious-skeptic literature checks (all currently re-fetch full text from remote APIs each invocation).
Reproducibility audits (deterministic replay needs local PDFs).

v1 design reference: [docs/design/artifacts_commit_path.md](../../design/artifacts_commit_path.md) — § "Per-artifact-type handling" + § "Paper JSON" row.
v1 path module: scidex/core/paths.py:141-155 (PAPER_CACHE_DIR, PAPER_PDF_DIR, PAPER_CLAIMS_DIR, PAPER_FIGURES_DIR).
v1 cache impl (read-only history): paper_cache.py:229-253.
Related quest: quest_engine_paper_fulltext_cache_backfill_spec.md — earlier attempt that landed PMC XML inline in JSON rather than as a separate PDF.
Related quest: quest_engine_paper_figure_extraction_backfill_spec.md — figure extraction without a paper-folder home.
ADR: docs/planning/decisions/002-artifacts-separate-repo.md — submodule separation.

Work log

2026-05-18 — Spec created

Session investigated the current paper cache (JSON-only, no PDFs ever) and proposed the architecture above.
4 design decisions left open pending user sign-off (table above).
Workflow choice still pending: (a) wait for answers before implementation, or (b) start with recommended defaults and let review redirect.
Target deliberately set to substrate v2 because v1 is frozen as of 2026-05-13.

File: quest_paper_pdf_caching_spec.md

Modified: 2026-05-18 19:22

Size: 10.9 KB