[Atlas] Processing step lineage — track transforms in provenance chains

Goal

Extend the provenance system to capture not just parent-child artifact relationships but
the processing steps between them. When an experiment is extracted from a paper, the
provenance should record: "Paper 12345 was processed by extraction-agent using
llm_structured_extraction method with schema v2, producing experiment artifact X."

This creates a full audit trail of how every artifact was constructed.

Current State

artifact_links captures derives_from, cites, extends relationships
provenance_chain JSON in artifacts captures parent artifacts
Neither captures the transform applied (what method, what agent, what parameters)

Acceptance Criteria

☑ processing_steps table or extended artifact_links metadata:

- source_artifact_id — input artifact
- target_artifact_id — output artifact
- step_type — extraction, analysis, aggregation, transformation, validation, debate
- agent_id — which agent performed the step
- method — what method/tool was used
- parameters — JSON of method parameters
- started_at, completed_at — timing (populated in _upsert_processing_step)
- input_hash, output_hash — for reproducibility verification (auto-derived from artifact content_hashes)

☑ record_processing_step() function in scidex/atlas/artifact_registry.py — public wrapper around _upsert_processing_step

☑ Processing steps shown in provenance graph visualization — get_provenance_graph enriches edges with processing step metadata via LEFT JOIN

☑ Reproducibility check: check_reproducibility(source_id, method, params) in artifact_registry; exposed via GET /api/artifact/{id}/reproducibility-check

☑ API: GET /api/artifact/{id}/processing-history — delegates to get_processing_lineage

Dependencies

None (parallel with schema governance, integrates with provenance system)

Dependents

a17-24-REPR0001 — Reproducible analysis chains use processing steps
d16-24-PROV0001 — Provenance demo showcases processing lineage

Work Log

2026-04-25 — Implementation (task:sen-sg-06-PROC)

Context: processing_steps table and _upsert_processing_step existed from a prior agent
(task 7ba524d5). The input_hash/output_hash/started_at/completed_at columns were in the
DB schema but not being populated; record_processing_step() was private and unnamed;
provenance graph didn't join with processing_steps; no /processing-history or /reproducibility-check endpoints existed.

Changes made:

scidex/atlas/artifact_registry.py:

- ensure_processing_steps_schema: extended SQLite and PG branches to add
input_hash, output_hash, started_at, completed_at columns (CREATE TABLE
and ALTER TABLE migration guards)
- Added _fetch_artifact_content_hash() and _compute_step_input_hash() helpers
- _upsert_processing_step: now auto-derives input_hash from
SHA256(source_content_hash | method | sorted_params) and output_hash from
target artifact's content_hash; populates started_at/completed_at
- Added public record_processing_step() — the named function the spec requires
- Added check_reproducibility() — compares output_hashes across prior runs with
the same input fingerprint
- get_provenance_graph: LEFT JOINs processing_steps to enrich each edge with
step_type, agent_id, method, parameters, input_hash, output_hash,
timing fields when available

api.py:

- GET /api/artifact/{artifact_id}/processing-history — full transform chain
- GET /api/artifact/{artifact_id}/reproducibility-check?method=...&parameters=...

tests/test_processing_lineage_api_contracts.py: added 9 new AST-based tests

covering the two new routes, record_processing_step, check_reproducibility,
provenance graph JOIN, and hash population

All 13 contract tests + 4 functional tests pass.

Tasks using this spec (1)

[Atlas] Processing step lineage — track transforms in proven

Schema Governance done P87

File: sen-sg-06-PROC_processing_step_lineage_spec.md

Modified: 2026-05-01 20:13

Size: 4.0 KB