[Forge] Skill versioning + drift detector - quality-stability per bundle SHA done

← Forge
Pin invocations to bundle_sha; raise drift events (KS + z-test) when new SHA's success/citation/latency diverge from prior.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/5d6c710c-skill-versioning-drift-detector-quality (3 commits) (#726)2026-04-27
Spec File

Effort: deep

Goal

migrations/114_skill_registry_canonical.py already gives every skill a version and bundle_sha, but agent_skill_invocations does not record
which version produced each output. When a bundle author edits the SKILL.md
prompt, the citation/error rate can shift silently. Build a drift detector
that pins each invocation to a (skill_name, bundle_sha) pair, computes
per-version quality distributions, and raises an alert whenever a new
version's success or citation rate diverges meaningfully from the prior
version on a matched workload.

Acceptance Criteria

☑ Migration migrations/20260428_skill_invocations_version.sql adds
bundle_sha TEXT and bundle_version TEXT columns to
agent_skill_invocations, plus partial index
idx_asi_skill_sha ON agent_skill_invocations(skill_name, bundle_sha).
scidex/agora/skill_evidence.py:_log_invocation reads the current
bundle_sha + version from the registry (cached in-process for
30 s) and writes both fields. Backfill SQL in the migration sets
historical rows to ('', '') so queries can WHERE bundle_sha != ''.
☑ Drift module scidex/forge/skill_drift.py:
compute_drift(skill_name: str, current_sha: str, prior_sha: str,
window_days: int = 30) -> DriftResult
. Uses Welch's t-test on
latency, two-proportion z-test on success_rate + citation_rate, and
Wilson 95% CI on each rate. Returns is_drifted: bool (any
p-value < 0.01 with effect size ≥ 0.05 absolute), the full
statistic block, and a severity enum (none/low/high).
☑ On every successful auto-register-as-drifted event from
q-skills-bundle-auto-discovery, compute_drift() is invoked
automatically; results are persisted to a new
skill_drift_events(id, skill_name, prior_sha, new_sha, severity,
stats_json, raised_at, ack_at, ack_by)
table.
☑ HTML page /forge/skills/drift lists all unacknowledged events,
sortable by severity + raised_at. POST /api/forge/skills/drift/{id}/ack
acknowledges an event with optional note.
☑ Alert hook: severity = 'high' writes a row to senate_alerts
mirroring the integrity-sweeper pattern; severity low only logs.
☑ Tests tests/test_skill_drift.py:
(a) two SHAs, identical distributions → is_drifted=False.
(b) success_rate flips 0.95 → 0.50, n=200 each → severity=high.
(c) latency p50 doubles, success unchanged → severity reflects
latency-only drift, citation-rate p-value not significant.
(d) low n (5/5) → CI too wide → is_drifted=False even with big
point shift (avoids false alarms on quiet skills).
☑ Smoke: insert two synthetic SHAs of pubmed_search with 100
invocations each, force the second to have 50% success — drift
event row written with severity=high.

Approach

  • Migration first; backfill SHAs as empty so the column is non-null
  • tolerant.
  • Update _log_invocation with a tiny in-process LRU
  • (functools.lru_cache(maxsize=128)) keyed on skill_name and
    refreshed every 30 s.
  • The statistics live in scipy.stats (already a dependency via
  • pydeseq2 skill). Wilson CI is a hand-rolled helper to avoid pulling
    statsmodels solely for it.
  • Wire the alert + HTML; reuse templates/forge/base.html.
  • Tests use numpy.random.RandomState(42) for determinism.
  • Dependencies

    • q-skills-bundle-auto-discovery — surfaces the new SHA.
    • q-skills-usage-telemetry — share the latency/citation-rate query
    primitives.

    Dependents

    • q-skills-cost-rationality — drift severity feeds the routing-cost
    model (do not route hot traffic to a drifted skill).

    Work Log

    • 2026-04-27 — Implemented full spec. Migration 20260428_skill_invocations_version.sql adds
    bundle_sha TEXT + bundle_version TEXT to agent_skill_invocations with
    partial index idx_asi_skill_sha WHERE bundle_sha != ''. Backfill SQL sets
    historical rows to ('', ''). _log_invocation updated with 30-s lru_cache on
    _get_bundle_info(skill_name). All 5 call sites pass bundle_sha/bundle_version.
    scidex/forge/skill_drift.py implements compute_drift() with Welch's t-test on
    latency, two-proportion z-test on success/cited rates, Wilson 95% CI.
    skill_drift_events extended with prior_sha, new_sha, stats_json,
    ack_at, ack_by. HTML /forge/skills/drift + POST ack endpoint added.
    senate_alerts table created (high-severity hook). 7 tests pass including
    smoke test that confirms severity=high drift event is written to DB.
    Commit d7e1671cb.

    Sibling Tasks in Quest (Forge) ↗