[Atlas] Processing step lineage — track transforms in provenance chains done

← Schema Governance
processing_steps table capturing agent, method, parameters, timing, hashes for reproducibility

Completion Notes

Auto-release: work already on origin/main

Git Commits (17)

Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (117 commits) (#179)2026-04-26
Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (116 commits) (#177)2026-04-26
Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (80 commits) (#143)2026-04-26
Squash merge: orchestra/task/sen-sg-0-auto-migration-generation-from-approved (4 commits) (#90)2026-04-26
[Senate] Add schema proposal create/vote API routes; emit approval event [task:sen-sg-02-PROP]2026-04-25
[Atlas] Processing step lineage: add record_processing_step, reproducibility check, provenance graph enrichment, and /processing-history API (api.py) [task:sen-sg-06-PROC] (#43)2026-04-25
Squash merge: orchestra/task/sen-sg-0-domain-scope-enforcement-reject-out-of-s (2 commits)2026-04-25
[Senate] Update spec work log with integration work [task:sen-sg-03-VALD]2026-04-25
[Senate] Wire JSON Schema validation into register_artifact() with strict/warn/skip modes [task:sen-sg-03-VALD]2026-04-25
[Senate] Fix schema registry for PostgreSQL; add validation and compliance API [task:sen-sg-03-VALD]2026-04-25
[Atlas/Senate/Agora] Spec: notebook + artifact versioning extensions2026-04-24
Squash merge: orchestra/task/sen-sg-0-schema-registry-track-schemas-per-artifa (1 commits)2026-04-18
Squash merge: orchestra/task/47b17cbf-sen-sg-01-sreg-schema-registry-track-art (1 commits)2026-04-16
[Senate] Add schema registry API: GET /api/schemas and /api/schemas/{type} in api.py [task:sen-sg-01-SREG]2026-04-16
[Senate] Schema registry: migration, seeding, and /senate/schemas UI [task:47b17cbf-a8ac-419e-9368-7a2669da25a8]2026-04-06
[Senate] Holistic prioritization run 2: quest fixes + 3 new CI tasks [task:b4c60959-0fe9-4cba-8893-c88013e85104]2026-04-06
[Senate] Holistic prioritization: 6 tasks created for uncovered P88-P95 quests [task:b4c60959-0fe9-4cba-8893-c88013e85104]2026-04-06
Spec File

Goal

Extend the provenance system to capture not just parent-child artifact relationships but
the processing steps between them. When an experiment is extracted from a paper, the
provenance should record: "Paper 12345 was processed by extraction-agent using
llm_structured_extraction method with schema v2, producing experiment artifact X."

This creates a full audit trail of how every artifact was constructed.

Current State

  • artifact_links captures derives_from, cites, extends relationships
  • provenance_chain JSON in artifacts captures parent artifacts
  • Neither captures the transform applied (what method, what agent, what parameters)

Acceptance Criteria

processing_steps table or extended artifact_links metadata:
- source_artifact_id — input artifact
- target_artifact_id — output artifact
- step_type — extraction, analysis, aggregation, transformation, validation, debate
- agent_id — which agent performed the step
- method — what method/tool was used
- parameters — JSON of method parameters
- started_at, completed_at — timing (populated in _upsert_processing_step)
- input_hash, output_hash — for reproducibility verification (auto-derived from artifact content_hashes)
record_processing_step() function in scidex/atlas/artifact_registry.py — public wrapper around _upsert_processing_step
☑ Processing steps shown in provenance graph visualization — get_provenance_graph enriches edges with processing step metadata via LEFT JOIN
☑ Reproducibility check: check_reproducibility(source_id, method, params) in artifact_registry; exposed via GET /api/artifact/{id}/reproducibility-check
☑ API: GET /api/artifact/{id}/processing-history — delegates to get_processing_lineage

Dependencies

  • None (parallel with schema governance, integrates with provenance system)

Dependents

  • a17-24-REPR0001 — Reproducible analysis chains use processing steps
  • d16-24-PROV0001 — Provenance demo showcases processing lineage

Work Log

2026-04-25 — Implementation (task:sen-sg-06-PROC)

Context: processing_steps table and _upsert_processing_step existed from a prior agent
(task 7ba524d5). The input_hash/output_hash/started_at/completed_at columns were in the
DB schema but not being populated; record_processing_step() was private and unnamed;
provenance graph didn't join with processing_steps; no /processing-history or /reproducibility-check endpoints existed.

Changes made:

  • scidex/atlas/artifact_registry.py:
- ensure_processing_steps_schema: extended SQLite and PG branches to add
input_hash, output_hash, started_at, completed_at columns (CREATE TABLE
and ALTER TABLE migration guards)
- Added _fetch_artifact_content_hash() and _compute_step_input_hash() helpers
- _upsert_processing_step: now auto-derives input_hash from
SHA256(source_content_hash | method | sorted_params) and output_hash from
target artifact's content_hash; populates started_at/completed_at
- Added public record_processing_step() — the named function the spec requires
- Added check_reproducibility() — compares output_hashes across prior runs with
the same input fingerprint
- get_provenance_graph: LEFT JOINs processing_steps to enrich each edge with
step_type, agent_id, method, parameters, input_hash, output_hash,
timing fields when available

  • api.py:
- GET /api/artifact/{artifact_id}/processing-history — full transform chain
- GET /api/artifact/{artifact_id}/reproducibility-check?method=...&parameters=...

  • tests/test_processing_lineage_api_contracts.py: added 9 new AST-based tests
covering the two new routes, record_processing_step, check_reproducibility,
provenance graph JOIN, and hash population

All 13 contract tests + 4 functional tests pass.

Sibling Tasks in Quest (Schema Governance) ↗

Task Dependencies

↓ Referenced by (downstream)