[Forge] Artifact enrichment quest — evaluation consumer

Task

ID: fbb838fb-e5aa-4515-8f30-00959622ce98
Type: recurring
Frequency: every-6h
Layer: Forge

Goal

Keep model, notebook, and benchmark artifacts connected to the evaluation
context they need in order to be trusted: which dataset or benchmark was used
to evaluate them, what other artifacts share their entities, and where their
provenance chain starts. Naked artifacts (no eval dataset, no provenance) are
unusable for downstream claims.

What it does

Operates in three phases per cycle:

Evaluation context: For each artifact missing an

evaluation_dataset_id / evaluation_benchmark_id, attempts to infer
and attach the correct reference by:
- scanning the artifact's README / notebook cells for
dataset: / benchmark: mentions,
- matching against the dataset_registry and benchmarks tables,
- writing back the inferred link with confidence_score.

Cross-links: For pairs of artifacts that share ≥2 entities (same

gene, drug, pathway), inserts a row in artifact_links with edge type
shares_entities and the list of shared entity IDs.

Provenance backfill: For each artifact with an empty

provenance_chain, walks upstream dependencies (input datasets,
parent notebooks, generating hypotheses) and materialises the chain
as a JSON column.

Idempotent: each artifact in each phase is skipped if its state is already

filled for that phase this cycle.

Release as a no-op when all artifacts are fully enriched.
Emits agent_contributions (type=artifact_enrichment) per write.

Success criteria

Fraction of artifacts with a non-null evaluation_dataset_id or

evaluation_benchmark_id trends to 100% for artifacts that claim to be
evaluated.

Cross-link graph is non-empty: every artifact with ≥2 shared entities has

at least one shares_entities row.

Every artifact has a populated provenance_chain or is explicitly marked

as having no upstream (provenance_chain = '[]').

Run log: artifacts scanned per phase, links inferred, cross-links created,

provenance filled, retries.

Quality requirements

No stubs: inferred evaluation links must include a real dataset/benchmark

ID and a non-zero confidence score, not a placeholder — link to meta-quest
quest_quality_standards_spec.md.

When processing ≥10 artifacts per cycle, use 3–5 parallel agents (one per

phase or one per slice) so the three phases don't serialise.

Log total items processed + retries so we can detect busywork (same

artifact revisited without state change → either idempotency bug or
inference is truly ambiguous — escalate to a Senate task in the latter
case).

INFERRED: artifact_links and provenance_chain column names follow the

pattern in artifact_enrichment_quest_spec.md; the driver reads schema
at runtime.

File: fbb838fb-e5aa-4515-8f30-00959622ce98_artifact_enrichment_eval_consumer_spec.md

Modified: 2026-05-01 20:13

Size: 3.1 KB