[Substrate] Paper PDF + figure caching with per-paper artifact folders

← All Specs

Status: proposed (open design questions — see § Open decisions) Layer: Atlas (paper cache) / Forge (figure extraction) Owner: unassigned Created: 2026-05-18 Effort: deep Target repo: [SciDEX-Substrate](https://github.com/SciDEX-AI/SciDEX-Substrate) — v1 is frozen (2026-05-13); ship there, not in v1.

Problem

SciDEX has cached paper metadata since day one (data/scidex-papers/<pmid>.json, ~2,875 files, ~21 MB), but has never cached PDFs or figure images. Concretely:

  • paper_cache.py:_write_file_cache() writes JSON only.
  • papers.local_path exists but is empty for every row.
  • papers.fulltext_cached flag was retrofitted by quest_engine_paper_fulltext_cache_backfill_spec to mean "PMC XML/plaintext stored inline in the JSON" — there is still no PDF on disk.
  • papers.figures_extracted is set by backfill_figures.py / extract_figures_for_batch.py, but figure files land in data/scidex-artifacts/figures/papers/<pmid>/ (separate from the paper folder) and most rows have only remote URLs in paper_figures.image_path.
  • The page_cache.db SQLite file in the repo root caches HTML page renders and PMC figure URL resolutions — it is not a paper PDF cache.
  • scidex/core/paths.py defines PAPER_PDF_DIR and PAPER_CLAIMS_DIR but nothing writes there.

Net effect: every paper read by an agent reaches into a remote API (PubMed, Unpaywall, PMC) for full text or figures every time. We lose reproducibility, rate-limit headroom, and offline-replay capability.

The flat layout (<pmid>.json at the root of the papers submodule) also has a scaling ceiling — at the current growth rate (recent batches add ~2,250 papers per push) we will exceed 10K entries in a single directory inside ~6 months.

Goal

For every paper fetch in substrate v2, materialize the paper as a first-class artifact with its own per-paper folder under data/scidex-papers/ (or substrate's equivalent root). Each folder holds:

data/scidex-papers/<aa>/<paper_uuid>/
  paper.json          # metadata (today's payload)
  paper.pdf           # full text PDF — when an open-access source exists
  paper.fulltext.xml  # PMC XML when available (current quest_engine_paper_fulltext output)
  figures/<fig_id>.png|jpg
  figures/<fig_id>.caption.txt
  manifest.json       # artifact manifest (matches commit_artifact_to_folder convention)

<aa> = first 2 hex chars of paper_uuid, same shard scheme as scidex.core.paths.artifact_dir() (200+ shards keep per-dir entry counts below 4K at 1M-paper ceiling — git-objects-style).

Path resolution lives in substrate's equivalent of scidex/core/paths.py via three new helpers:

  • paper_dir(paper_artifact_id) -> Path — mirrors artifact_dir().
  • paper_pdf_path(paper_artifact_id) -> Path
  • paper_json_path(paper_artifact_id) -> Path
  • paper_figures_dir(paper_artifact_id) -> Path

Read-time fallback: if <aa>/<uuid>/paper.json is missing, fall back to legacy flat <pmid>.json so existing v1-frozen artifacts remain readable until the migration is complete.

Final architecture

papers table (PG, substrate)
  paper_id  (existing, paper-<pmid> form)
  artifact_id UUID  ← NEW, FK to artifacts(artifact_id)
  pmid, doi, pmc_id, title, abstract, ...   ← unchanged
  fulltext_cached  (existing, 0|1) — true iff PDF or PMC XML on disk
  figures_extracted (existing, 0|1) — true iff figures/ dir non-empty

artifacts table (PG, substrate)
  artifact_id UUID  (primary)
  artifact_type = 'paper'  ← new value (joins existing enum)
  ...standard artifact columns (origin_type, version_number, ...)

data/scidex-papers/<aa>/<paper_uuid>/
  paper.json
  paper.pdf            (when fetchable)
  paper.fulltext.xml   (when fetchable)
  figures/<n>.png
  manifest.json

paper_figures table
  unchanged columns; image_path now points into the paper folder
  (data/scidex-papers/<aa>/<paper_uuid>/figures/<n>.png) instead of
  data/scidex-artifacts/figures/papers/<pmid>/...

PDF fetch pipeline (substrate side, new module — proposed substrate/papers/pdf_fetcher.py):

  • PMC OA Service — for any paper with a pmc_id, try https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={pmc_id} and follow the FTP tarball. Highest yield for biomed (~30-40%).
  • Unpaywallhttps://api.unpaywall.org/v2/{doi}?email=…best_oa_location.url_for_pdf. Adds another ~30%.
  • Direct preprint URLs — arXiv (https://arxiv.org/pdf/<id>.pdf), bioRxiv/medRxiv (https://www.biorxiv.org/content/{doi}.full.pdf).
  • Paywalled / no OA source — write JSON only, set fulltext_cached=0, log the skip with reason in the manifest. Do not write empty PDFs.
  • Open decisions (need user sign-off before implementation)

    #DecisionRecommendedTradeoff
    1Folder name = <paper_uuid> (new UUID column) or reuse paper-<pmid>UUIDUUID = consistent with artifact_dir(), clean join via papers.artifact_id. Costs a backfill migration for the 2,875 frozen-v1 papers if we ever want to migrate them — but v1 is frozen so we can leave v1's flat layout alone and start sharded in v2.
    2Extracted figures co-located with paper (papers/<aa>/<uuid>/figures/) or as standalone artifacts (artifacts/<aa>/<fig_uuid>/)Co-locatedCo-located = one tree per paper, simpler backup/delete. Standalone = each figure is a first-class artifact in its own shard, deduplicates reused figures, but you chase pointers to assemble a paper. paper_figures keeps the existing artifact_id row for KG linkage either way — only the file location differs.
    3Submodule transport: plain git vs git-LFS for *.pdf + figure binariesgit-LFSAt ~3-5 MB per PDF × current corpus (~15K cited papers) × growth rate, plain git balloons fast. LFS adds operational surface (LFS server quota, fetch costs) but is the right answer at this scale. Alternative: rely on substrate's blob-store and keep only manifests in the submodule.
    4PDF coverage tolerance for paywalled papersBest-effort, leave fulltext_cached=0We will only capture ~50-70% of cited papers across OA sources. Paywalled-only papers keep the metadata-only treatment. No fake PDFs.

    PR plan

    #LayerTitleAdds
    1Atlassubstrate: paper path helpers + papers.artifact_id migrationpaper_dir()/paper_pdf_path()/paper_json_path()/paper_figures_dir(); migration adds nullable artifact_id UUID + index on papers.
    2Atlassubstrate: register every paper as an artifact row on first cachepaper_cache.get_paper() write path goes through register_artifact(artifact_type='paper') and assigns the UUID.
    3Atlassubstrate: new sharded JSON layout + back-compat read fallback_write_file_cache writes to paper_dir(uuid)/paper.json; _read_file_cache tries sharded first, falls back to legacy flat name.
    4Atlassubstrate: PDF fetcher (PMC OA → Unpaywall → preprint)New substrate/papers/pdf_fetcher.py; sets fulltext_cached=1 on success.
    5Forgesubstrate: figure extraction writes to per-paper figures/ dirUpdates _extract_figures_* callers to use paper_figures_dir(uuid); updates paper_figures.image_path writes accordingly.
    6Atlassubstrate: manifest.json per paper folderMirrors commit_artifact_to_folder manifest convention.
    7SenateCompliance probe extensionscripts/check_artifact_compliance.py (substrate copy) flags papers with fulltext_cached=1 but no PDF on disk, and vice versa.
    8AtlasOptional: LFS pointers for .pdf / figures/.{png,jpg}.gitattributes in substrate's papers root; only land if decision 3 = LFS.

    Acceptance criteria

    ☐ In substrate, calling get_paper(pmid) for a paper with PMC OA availability results in paper.pdf on disk at data/scidex-papers/<aa>/<uuid>/paper.pdf and fulltext_cached=1 in the papers table.
    paper_dir(uuid) returns a path under the substrate papers root with the 2-hex shard prefix.
    ☐ Each cached paper has a row in artifacts with artifact_type='paper' and a manifest.json in its folder.
    ☐ Figure files written by _extract_figures_* live under paper_figures_dir(uuid), and paper_figures.image_path references that path.
    ☐ Read fallback: a paper present in the legacy flat layout (v1 frozen submodule) is still readable through get_paper() during the migration window.
    scripts/check_artifact_compliance.py (substrate) reports zero fulltext_cached_but_no_pdf violations after a fresh fetch cycle.
    ☐ Paywalled-only papers cleanly write JSON, leave fulltext_cached=0, and log a structured pdf_fetch_skip event with the reason (paywalled, no_oa_record, unpaywall_404, etc.).

    Out of scope

    • v1 backfill — the SciDEX v1 papers submodule is pinned at ac08e0ae (tag v1-frozen-2026-05-13) and is not getting updated. Substrate starts with its own clean papers store.
    • Replacing page_cache.db (UI page render cache + PMC figure URL cache). Distinct mechanism, different lifecycle.
    • Full-text claim extraction (covered by extract_paper_claims_3ef88d85_spec).

    Dependencies

    • Substrate v2 must have a papers table with the columns listed above. Today's PG papers schema is close — only artifact_id UUID is new.
    • Substrate artifacts enum must accept artifact_type='paper'.
    • Decision on git-LFS vs. blob-store transport (#3 above) before PR 4 lands.

    Dependents

    • Claim extraction, figure extraction, replication audits, oncology-skeptic & infectious-skeptic literature checks (all currently re-fetch full text from remote APIs each invocation).
    • Reproducibility audits (deterministic replay needs local PDFs).

    Related

    • v1 design reference: [docs/design/artifacts_commit_path.md](../../design/artifacts_commit_path.md) — § "Per-artifact-type handling" + § "Paper JSON" row.
    • v1 path module: scidex/core/paths.py:141-155 (PAPER_CACHE_DIR, PAPER_PDF_DIR, PAPER_CLAIMS_DIR, PAPER_FIGURES_DIR).
    • v1 cache impl (read-only history): paper_cache.py:229-253.
    • Related quest: quest_engine_paper_fulltext_cache_backfill_spec.md — earlier attempt that landed PMC XML inline in JSON rather than as a separate PDF.
    • Related quest: quest_engine_paper_figure_extraction_backfill_spec.md — figure extraction without a paper-folder home.
    • ADR: docs/planning/decisions/002-artifacts-separate-repo.md — submodule separation.

    Work log

    2026-05-18 — Spec created

    • Session investigated the current paper cache (JSON-only, no PDFs ever) and proposed the architecture above.
    • 4 design decisions left open pending user sign-off (table above).
    • Workflow choice still pending: (a) wait for answers before implementation, or (b) start with recommended defaults and let review redirect.
    • Target deliberately set to substrate v2 because v1 is frozen as of 2026-05-13.

    File: quest_paper_pdf_caching_spec.md
    Modified: 2026-05-18 19:22
    Size: 10.9 KB