[Atlas] Processing step lineage — track transforms in provenance chains

← All Specs

Goal

Extend the provenance system to capture not just parent-child artifact relationships but
the processing steps between them. When an experiment is extracted from a paper, the
provenance should record: "Paper 12345 was processed by extraction-agent using
llm_structured_extraction method with schema v2, producing experiment artifact X."

This creates a full audit trail of how every artifact was constructed.

Current State

  • artifact_links captures derives_from, cites, extends relationships
  • provenance_chain JSON in artifacts captures parent artifacts
  • Neither captures the transform applied (what method, what agent, what parameters)

Acceptance Criteria

processing_steps table or extended artifact_links metadata:
- source_artifact_id — input artifact
- target_artifact_id — output artifact
- step_type — extraction, analysis, aggregation, transformation, validation, debate
- agent_id — which agent performed the step
- method — what method/tool was used
- parameters — JSON of method parameters
- started_at, completed_at — timing (populated in _upsert_processing_step)
- input_hash, output_hash — for reproducibility verification (auto-derived from artifact content_hashes)
record_processing_step() function in scidex/atlas/artifact_registry.py — public wrapper around _upsert_processing_step
☑ Processing steps shown in provenance graph visualization — get_provenance_graph enriches edges with processing step metadata via LEFT JOIN
☑ Reproducibility check: check_reproducibility(source_id, method, params) in artifact_registry; exposed via GET /api/artifact/{id}/reproducibility-check
☑ API: GET /api/artifact/{id}/processing-history — delegates to get_processing_lineage

Dependencies

  • None (parallel with schema governance, integrates with provenance system)

Dependents

  • a17-24-REPR0001 — Reproducible analysis chains use processing steps
  • d16-24-PROV0001 — Provenance demo showcases processing lineage

Work Log

2026-04-25 — Implementation (task:sen-sg-06-PROC)

Context: processing_steps table and _upsert_processing_step existed from a prior agent
(task 7ba524d5). The input_hash/output_hash/started_at/completed_at columns were in the
DB schema but not being populated; record_processing_step() was private and unnamed;
provenance graph didn't join with processing_steps; no /processing-history or /reproducibility-check endpoints existed.

Changes made:

  • scidex/atlas/artifact_registry.py:
- ensure_processing_steps_schema: extended SQLite and PG branches to add
input_hash, output_hash, started_at, completed_at columns (CREATE TABLE
and ALTER TABLE migration guards)
- Added _fetch_artifact_content_hash() and _compute_step_input_hash() helpers
- _upsert_processing_step: now auto-derives input_hash from
SHA256(source_content_hash | method | sorted_params) and output_hash from
target artifact's content_hash; populates started_at/completed_at
- Added public record_processing_step() — the named function the spec requires
- Added check_reproducibility() — compares output_hashes across prior runs with
the same input fingerprint
- get_provenance_graph: LEFT JOINs processing_steps to enrich each edge with
step_type, agent_id, method, parameters, input_hash, output_hash,
timing fields when available

  • api.py:
- GET /api/artifact/{artifact_id}/processing-history — full transform chain
- GET /api/artifact/{artifact_id}/reproducibility-check?method=...&parameters=...

  • tests/test_processing_lineage_api_contracts.py: added 9 new AST-based tests
covering the two new routes, record_processing_step, check_reproducibility,
provenance graph JOIN, and hash population

All 13 contract tests + 4 functional tests pass.

Tasks using this spec (1)
[Atlas] Processing step lineage — track transforms in proven
File: sen-sg-06-PROC_processing_step_lineage_spec.md
Modified: 2026-05-01 20:13
Size: 4.0 KB