[Senate] Hypothesis cohort tracker — survival analysis by birth-week

Effort: thorough

Goal

Hypotheses are minted continuously by Theorist agents, but the
platform never asks "of the hypotheses minted in week N, how many
survived to month 6?" — i.e. retained composite_score ≥ threshold,
weren't superseded, weren't quietly archived. This is the
fundamental epistemic-quality question: are we generating durable
ideas or burning compute on noise?

Build a Hypothesis Cohort Tracker: group hypotheses by
creation week (the cohort), compute survival/verification/Elo-
retention curves over time, surface which cohorts produced the
most durable ideas, and feed the curves into the q-epistemic-rigor quest as a quality KPI.

Acceptance Criteria

☐ New module scidex/senate/hypothesis_cohorts.py:

- compute_cohort(creation_week: date) -> dict returns

{cohort_size, survival_at: {30d: n, 90d: n, 180d: n,
        365d: n}, verification_at: {...}, elo_p50_at: {...},
        promoted_to_canonical: int, superseded: int,
        archived: int}

where survival means
composite_score ≥ 0.6 AND not superseded.
- recompute_all_cohorts() -> int walks every week from
the earliest hypothesis creation_at to the current week
and rebuilds the hypothesis_cohort_metrics table.

☐ New table hypothesis_cohort_metrics with

(cohort_week, cohort_size, snapshot_at, age_days,
      survivors, verified, mean_elo, median_elo, n_superseded,
      n_archived, n_promoted)

— one row per (cohort, snapshot)
pair. Snapshot taken weekly.

☐ Systemd timer scidex-hypothesis-cohorts-weekly.timer

runs Sunday 23:00 UTC, recomputes the latest snapshot for
every cohort.

☐ GET /cohorts/hypotheses dashboard:

- Header: total cohorts tracked, average 90-day survival
rate, trend arrow vs. trailing-quarter average.
- Heatmap: rows = cohort weeks, columns = age buckets
(30d / 90d / 180d / 365d), cell color = survival rate.
- "Best cohorts" sidebar: top-5 cohorts by 180-day
verification rate with link to each cohort's
hypothesis list.
- "Worst cohorts" sidebar: bottom-5 with prompt-evolution
diagnosis (which agent persona was generating that
week, were the source gaps stale, etc.) — sourced from
prompt_evolution.py history.

☐ GET /cohort/{week} per-cohort detail page:

- Survival curve (Kaplan-Meier-style step function).
- Full list of cohort hypotheses with current
composite_score and supersede status.
- Provenance breakdown: which agents/personas authored
them.

☐ Quality KPI integration: cohort survival rates are

written into

senate_metrics(metric=
      'hypothesis_cohort_survival_180d', value, week)

so the
q-epistemic-rigor quest can alert on regressions.

☐ Pytest: seed 4 cohorts of varied sizes and survival

profiles; recompute → assert metric rows match expected;
KM curve render-test asserts monotone non-increasing
survival values.

Approach

"Superseded" detection uses

scidex.atlas.supersede_resolver already in the codebase
(search supersede_resolver.py for the canonical helper).

Verification = composite_score ≥ 0.7 AND has at least

one evidence_assessment debate with verdict supports.

Snapshot is additive — never overwrite, so we keep a true

longitudinal record.

Heatmap rendering reuses q-live-market-liquidity-heatmap

color-scale logic.

Worst-cohort diagnosis prompt: feed prompt_evolution

history for that week + the cohort's hypothesis statements
to an LLM and ask "what went wrong" — short summary stored
on hypothesis_cohort_metrics.diagnosis_md.

Dependencies

scidex.atlas.supersede_resolver — supersede detection.
scidex.senate.prompt_evolution — diagnosis source.
q-time-hypothesis-history-viewer — rich per-hypothesis

history; cohort drill-down links here.

Work Log

Tasks using this spec (1)

[Senate] Hypothesis cohort tracker - survival analysis by bi

Epistemic Rigor done P87

File: q-time-hypothesis-cohort-tracker_spec.md

Modified: 2026-05-01 20:13

Size: 4.0 KB