Status: proposed (open design questions — see § Open decisions)
Layer: Atlas (paper cache) / Forge (figure extraction)
Owner: unassigned
Created: 2026-05-18
Effort: deep
Target repo: [SciDEX-Substrate](https://github.com/SciDEX-AI/SciDEX-Substrate) — v1 is frozen (2026-05-13); ship there, not in v1.
SciDEX has cached paper metadata since day one (data/scidex-papers/<pmid>.json, ~2,875 files, ~21 MB), but has never cached PDFs or figure images. Concretely:
paper_cache.py:_write_file_cache() writes JSON only.papers.local_path exists but is empty for every row.papers.fulltext_cached flag was retrofitted by quest_engine_paper_fulltext_cache_backfill_spec to mean "PMC XML/plaintext stored inline in the JSON" — there is still no PDF on disk.papers.figures_extracted is set by backfill_figures.py / extract_figures_for_batch.py, but figure files land in data/scidex-artifacts/figures/papers/<pmid>/ (separate from the paper folder) and most rows have only remote URLs in paper_figures.image_path.page_cache.db SQLite file in the repo root caches HTML page renders and PMC figure URL resolutions — it is not a paper PDF cache.scidex/core/paths.py defines PAPER_PDF_DIR and PAPER_CLAIMS_DIR but nothing writes there.The flat layout (<pmid>.json at the root of the papers submodule) also has a scaling ceiling — at the current growth rate (recent batches add ~2,250 papers per push) we will exceed 10K entries in a single directory inside ~6 months.
For every paper fetch in substrate v2, materialize the paper as a first-class artifact with its own per-paper folder under data/scidex-papers/ (or substrate's equivalent root). Each folder holds:
data/scidex-papers/<aa>/<paper_uuid>/
paper.json # metadata (today's payload)
paper.pdf # full text PDF — when an open-access source exists
paper.fulltext.xml # PMC XML when available (current quest_engine_paper_fulltext output)
figures/<fig_id>.png|jpg
figures/<fig_id>.caption.txt
manifest.json # artifact manifest (matches commit_artifact_to_folder convention)<aa> = first 2 hex chars of paper_uuid, same shard scheme as scidex.core.paths.artifact_dir() (200+ shards keep per-dir entry counts below 4K at 1M-paper ceiling — git-objects-style).
Path resolution lives in substrate's equivalent of scidex/core/paths.py via three new helpers:
paper_dir(paper_artifact_id) -> Path — mirrors artifact_dir().paper_pdf_path(paper_artifact_id) -> Pathpaper_json_path(paper_artifact_id) -> Pathpaper_figures_dir(paper_artifact_id) -> Path<aa>/<uuid>/paper.json is missing, fall back to legacy flat <pmid>.json so existing v1-frozen artifacts remain readable until the migration is complete.papers table (PG, substrate)
paper_id (existing, paper-<pmid> form)
artifact_id UUID ← NEW, FK to artifacts(artifact_id)
pmid, doi, pmc_id, title, abstract, ... ← unchanged
fulltext_cached (existing, 0|1) — true iff PDF or PMC XML on disk
figures_extracted (existing, 0|1) — true iff figures/ dir non-empty
artifacts table (PG, substrate)
artifact_id UUID (primary)
artifact_type = 'paper' ← new value (joins existing enum)
...standard artifact columns (origin_type, version_number, ...)
data/scidex-papers/<aa>/<paper_uuid>/
paper.json
paper.pdf (when fetchable)
paper.fulltext.xml (when fetchable)
figures/<n>.png
manifest.json
paper_figures table
unchanged columns; image_path now points into the paper folder
(data/scidex-papers/<aa>/<paper_uuid>/figures/<n>.png) instead of
data/scidex-artifacts/figures/papers/<pmid>/...PDF fetch pipeline (substrate side, new module — proposed substrate/papers/pdf_fetcher.py):
pmc_id, try https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={pmc_id} and follow the FTP tarball. Highest yield for biomed (~30-40%).https://api.unpaywall.org/v2/{doi}?email=… → best_oa_location.url_for_pdf. Adds another ~30%.https://arxiv.org/pdf/<id>.pdf), bioRxiv/medRxiv (https://www.biorxiv.org/content/{doi}.full.pdf).fulltext_cached=0, log the skip with reason in the manifest. Do not write empty PDFs.get_paper(pmid) for a paper with PMC OA availability results in paper.pdf on disk at data/scidex-papers/<aa>/<uuid>/paper.pdf and fulltext_cached=1 in the papers table.paper_dir(uuid) returns a path under the substrate papers root with the 2-hex shard prefix.artifacts with artifact_type='paper' and a manifest.json in its folder._extract_figures_* live under paper_figures_dir(uuid), and paper_figures.image_path references that path.get_paper() during the migration window.scripts/check_artifact_compliance.py (substrate) reports zero fulltext_cached_but_no_pdf violations after a fresh fetch cycle.fulltext_cached=0, and log a structured pdf_fetch_skip event with the reason (paywalled, no_oa_record, unpaywall_404, etc.).ac08e0ae (tag v1-frozen-2026-05-13) and is not getting updated. Substrate starts with its own clean papers store.page_cache.db (UI page render cache + PMC figure URL cache). Distinct mechanism, different lifecycle.extract_paper_claims_3ef88d85_spec).papers table with the columns listed above. Today's PG papers schema is close — only artifact_id UUID is new.artifacts enum must accept artifact_type='paper'.docs/design/artifacts_commit_path.md](../../design/artifacts_commit_path.md) — § "Per-artifact-type handling" + § "Paper JSON" row.scidex/core/paths.py:141-155 (PAPER_CACHE_DIR, PAPER_PDF_DIR, PAPER_CLAIMS_DIR, PAPER_FIGURES_DIR).paper_cache.py:229-253.quest_engine_paper_fulltext_cache_backfill_spec.md — earlier attempt that landed PMC XML inline in JSON rather than as a separate PDF.quest_engine_paper_figure_extraction_backfill_spec.md — figure extraction without a paper-folder home.docs/planning/decisions/002-artifacts-separate-repo.md — submodule separation.