SciDEX — Task: [Forge] Skill versioning + drift detector

Pin invocations to bundle_sha; raise drift events (KS + z-test) when new SHA's success/citation/latency diverge from prior.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/5d6c710c-skill-versioning-drift-detector-quality (3 commits) (#726)2026-04-27

Spec File

Effort: deep

Goal

migrations/114_skill_registry_canonical.py already gives every skill a version and bundle_sha, but agent_skill_invocations does not record
which version produced each output. When a bundle author edits the SKILL.md
prompt, the citation/error rate can shift silently. Build a drift detector
that pins each invocation to a (skill_name, bundle_sha) pair, computes
per-version quality distributions, and raises an alert whenever a new
version's success or citation rate diverges meaningfully from the prior
version on a matched workload.

Acceptance Criteria

☑ Migration migrations/20260428_skill_invocations_version.sql adds

bundle_sha TEXT and bundle_version TEXT columns to
agent_skill_invocations, plus partial index
idx_asi_skill_sha ON agent_skill_invocations(skill_name, bundle_sha).

☑ scidex/agora/skill_evidence.py:_log_invocation reads the current

bundle_sha + version from the registry (cached in-process for
30 s) and writes both fields. Backfill SQL in the migration sets
historical rows to ('', '') so queries can WHERE bundle_sha != ''.

☑ Drift module scidex/forge/skill_drift.py:

compute_drift(skill_name: str, current_sha: str, prior_sha: str,
       window_days: int = 30) -> DriftResult

. Uses Welch's t-test on
latency, two-proportion z-test on success_rate + citation_rate, and
Wilson 95% CI on each rate. Returns is_drifted: bool (any
p-value < 0.01 with effect size ≥ 0.05 absolute), the full
statistic block, and a severity enum (none/low/high).

☑ On every successful auto-register-as-drifted event from

q-skills-bundle-auto-discovery, compute_drift() is invoked
automatically; results are persisted to a new

skill_drift_events(id, skill_name, prior_sha, new_sha, severity,
       stats_json, raised_at, ack_at, ack_by)

table.

☑ HTML page /forge/skills/drift lists all unacknowledged events,

sortable by severity + raised_at. POST /api/forge/skills/drift/{id}/ack
acknowledges an event with optional note.

☑ Alert hook: severity = 'high' writes a row to senate_alerts

mirroring the integrity-sweeper pattern; severity low only logs.

☑ Tests tests/test_skill_drift.py:

(a) two SHAs, identical distributions → is_drifted=False.
(b) success_rate flips 0.95 → 0.50, n=200 each → severity=high.
(c) latency p50 doubles, success unchanged → severity reflects
latency-only drift, citation-rate p-value not significant.
(d) low n (5/5) → CI too wide → is_drifted=False even with big
point shift (avoids false alarms on quiet skills).

☑ Smoke: insert two synthetic SHAs of pubmed_search with 100

invocations each, force the second to have 50% success — drift
event row written with severity=high.

Approach

Migration first; backfill SHAs as empty so the column is non-null

tolerant.

Update _log_invocation with a tiny in-process LRU

(functools.lru_cache(maxsize=128)) keyed on skill_name and
refreshed every 30 s.

The statistics live in scipy.stats (already a dependency via

pydeseq2 skill). Wilson CI is a hand-rolled helper to avoid pulling
statsmodels solely for it.

Wire the alert + HTML; reuse templates/forge/base.html.

Tests use numpy.random.RandomState(42) for determinism.

Dependencies

q-skills-bundle-auto-discovery — surfaces the new SHA.
q-skills-usage-telemetry — share the latency/citation-rate query

primitives.

Dependents

q-skills-cost-rationality — drift severity feeds the routing-cost

model (do not route hot traffic to a drifted skill).

Work Log

2026-04-27 — Implemented full spec. Migration 20260428_skill_invocations_version.sql adds

bundle_sha TEXT + bundle_version TEXT to agent_skill_invocations with
partial index idx_asi_skill_sha WHERE bundle_sha != ''. Backfill SQL sets
historical rows to ('', ''). _log_invocation updated with 30-s lru_cache on
_get_bundle_info(skill_name). All 5 call sites pass bundle_sha/bundle_version.
scidex/forge/skill_drift.py implements compute_drift() with Welch's t-test on
latency, two-proportion z-test on success/cited rates, Wilson 95% CI.
skill_drift_events extended with prior_sha, new_sha, stats_json,
ack_at, ack_by. HTML /forge/skills/drift + POST ack endpoint added.
senate_alerts table created (high-severity hook). 7 tests pass including
smoke test that confirms severity=high drift event is written to DB.
Commit d7e1671cb.