[Senate] CI: Ambitious quest task generator — xhigh-effort LLM-driven strategic task creation open analysis:9 coding:7 instruction_following:9 reasoning:10 safety:8

← Senate
Run with xhigh effort every 30 min. Read the full spec: docs/planning/specs/codex_ambitious_quest_task_generator_spec.md. Gap predicate: maintain >=5 open non-CI tasks at priority >=90. Each cycle: (Phase 1) snapshot queue + recurring CI health + world-model state; (Phase 2) synthesize strategic gaps via deep LLM thinking; (Phase 3) create ambitious tasks (capability/scientific-output/value-prop, not row-count backfill); (Phase 4) audit priorities; (Phase 5) log. Refuse to duplicate active recurring drivers — see spec table. Replaces the prior quest_engine.py template generator (now deprecated) which abandoned 30+ cycles producing filler one-shots.

Completion Notes

Auto-release: recurring task had no work this cycle

Last Error

watchdog: worker lease expired; requeued

Git Commits (20)

Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (3 commits) (#1302)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (2 commits) (#1292)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (34 commits) (#1264)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (28 commits) (#1260)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (12 commits) (#1243)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (2 commits) (#1234)2026-04-28
[Senate] Replace quest engine with ambitious LLM-driven generator (#1176)2026-04-28
[Senate] Replace quest engine with ambitious LLM-driven generator spec2026-04-28
[Senate] Create ambitious quest task generator spec + 6 strategic tasks [task:80ffb77b-8391-493c-8644-37086c8e2e3c] (#1211)2026-04-28
[Senate] Quest engine: pass no_spec=true when gap has no spec_path; better 409 conflict logging2026-04-28
[Senate] Quest engine CI cycle 42 work log: no new tasks, all blocked by dedup; fix commit already on main (#1185)2026-04-28
Squash merge: orchestra/task/cfc43b30-link-50-evidence-entries-to-target-artif (2 commits) (#1173)2026-04-28
[Senate] Quest engine: pass no_spec=true when gap has no spec_path; better 409 conflict logging2026-04-28
Squash merge: orchestra/task/7a9c642b-strategic-engine-guardian-auto-reopen-bl (32 commits) (#1052)2026-04-27
Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (4 commits) (#1035)2026-04-27
Squash merge: orchestra/task/b4e04fba-triage-50-failed-tool-calls-by-skill-and (95 commits) (#1011)2026-04-27
Squash merge: orchestra/task/9fc63687-add-pathway-diagrams-to-20-hypotheses-mi (17 commits) (#930)2026-04-27
[Senate] Fix quest_engine auth: use ORCHESTRA_WEB_TOKEN, add required API fields, remove duplicate functions [task:80ffb77b-8391-493c-8644-37086c8e2e3c] (#923)2026-04-27
Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (144 commits) (#479)2026-04-26
Squash merge: orchestra/task/80ffb77b-quest-engine-generate-tasks-from-quests (102 commits) (#432)2026-04-26
Spec File

Goal

Maintain a steady supply of ambitious, high-strategic-value tasks in the
SciDEX queue by running deep LLM analysis every 30 minutes. The work product
is new tasks (and priority adjustments on existing tasks) — never
boilerplate gap-counting. Output should advance core SciDEX capabilities
and accelerate cures and neuroscience, not just chip at row counts.

This spec deliberately replaces scripts/quest_engine.py and the prior quest-engine-ci.md template-driven approach, both of which produced
filler one-shots that mirror existing recurring CI work without adding
strategic leverage. The retired approach is documented for context only —
do not run it.

> ## Continuous-process anchor
>
> This spec instantiates theme S7 in docs/design/retired_scripts_patterns.md
> ("Quest task generation"). Every principle in the **"Design principles for
> continuous processes"** section is load-bearing. In particular:
>
> - LLMs for semantic judgment; rules for syntactic validation only
> - Gap-predicate driven, not calendar-driven
> - Idempotent + version-stamped + observable
> - No hardcoded entity lists, keyword lists, or canonical-name tables
> - Bounded batch (≤ 5 tasks per run unless deficit is larger)
> - Progressive improvement via outcome-feedback loop

The invariant this task enforces

At all times, SciDEX should have ≥ 5 open non-CI tasks at priority ≥ 90
that would, if completed, materially advance one of:

  • Core capability: scoring, debate, market mechanisms, KG quality,
  • tool reliability, governance, agent orchestration
  • Scientific output: hypothesis quality, evidence chains, falsifiable
  • predictions, experiment proposals, target validation for neurodegeneration
  • System value: SciDEX's role as a machine for prioritizing,
  • organizing, synthesizing, inventing, funding, and rewarding science

    Gap predicate (run this first — if false, mostly-no-op):

    SELECT COUNT(*) FROM tasks
    WHERE project_id = (SELECT id FROM projects WHERE name='SciDEX')
      AND status IN ('open','available')
      AND priority >= 90
      AND task_type IN ('one_shot','iterative')
      AND title NOT LIKE '%CI:%'
      AND title NOT LIKE '%[Watchdog]%';

    If COUNT() >= 5: skip generation. Run the priority audit* step only
    (below). If COUNT() < 5: generate (5 - COUNT()) to min(10, deficit*2)
    new tasks at priority ≥ 90.

    What "ambitious" means here (the bar)

    A task qualifies as ambitious if at least two of:

  • It targets a capability SciDEX doesn't have yet, or one whose current
  • implementation is a known-thin scaffold (not a row-count backfill).
  • It would, on completion, change the value of one of SciDEX's measurable
  • outputs (debate quality, market signal, hypothesis novelty, KG density,
    reproducibility, time-to-first-experiment, agent throughput).
  • It frames a scientific question the system itself should answer
  • (cross-disease mechanistic synthesis, novel target proposal with
    falsifiable prediction, paper-claim contradiction surface, etc.).
  • It builds a feedback loop / meta-mechanism (rubric versioning, tournament
  • driver, reward-eligibility check, calibration meta-job).
  • It improves cross-layer integration (Atlas ↔ Exchange, Agora ↔ Senate,
  • Forge ↔ Atlas) where today the layers are loosely coupled.

    A task is filler (do not create) if it is any of:

    • "Backfill column X for N rows" where a recurring CI task already covers
    this (see §"Recurring tasks already covering common gaps" below).
    • "Add references to N wiki pages" / "score N hypotheses" / "extract claims
    from N papers" — these mirror existing every-6h drivers; the right
    intervention is unsticking the driver, not creating a one-off chip.
    • Anything whose acceptance is "process N rows" without naming a capability
    improvement that wouldn't happen otherwise.
    • Anything that's a wrapper around a Senate dedup/cleanup activity that
    forces destructive action regardless of judgment.

    What the agent does each cycle (xhigh effort)

    You are running with --effort xhigh and should think hard before
    committing. A single well-framed ambitious task is worth more than
    five filler chips. Spend most of the cycle reading and synthesizing.

    Phase 1 — State snapshot (read-only, ~5 min)

    Gather and summarize:

  • Queue state via Orchestra MCP (list_tasks with project=SciDEX):
  • - Total open one-shot/iterative count
    - Count at priority ≥ 90, ≥ 95, ≥ 99 (excluding CI: and Watchdog tasks)
    - Top 25 highest-priority open tasks: id, title, priority, age, quest_id
    - Top 10 oldest open tasks (for staleness review)
    - Last 30 completed tasks (last 24h): id, title, layer, completion_summary

  • Recent run outcomes (last 24-48h):
  • - How many tasks completed vs abandoned vs still-running
    - Which quests/layers shipped real PRs (check pr_links_json,
    commit_links_json, merge_verified_at)
    - Which agents/models were most productive (group assigned_worker)

  • Active quests (Orchestra quests table):
  • - For each active quest: name, layer, priority, current open-task count
    - Read the quest's spec file under docs/planning/specs/quest_<layer>_spec.md
    and any quest_<layer>_<topic>_spec.md companions
    - Read the five mission quest specs (Q-DSC, Q-OPENQ, Q-LIVE, Q-PROP, Q-PERC)
    when relevant to detect cross-quest leverage

  • Recurring CI health (critical — distinguishes filler from ambitious):

  • SELECT id, title, frequency, last_completed_at,
              (julianday('now') - julianday(last_completed_at)) AS days_stale
       FROM tasks
       WHERE project_id=(SELECT id FROM projects WHERE name='SciDEX')
         AND task_type='recurring' AND status='open'
         AND priority >= 90
       ORDER BY days_stale DESC NULLS FIRST LIMIT 30;

    - If a recurring driver is stale > 24h, **a one-off chip at the same gap
    is filler.** Either propose unsticking the driver as the new task, or
    skip the gap entirely.

  • SciDEX world-model state via PostgreSQL (scidex.core.database).
  • Polymorphic via information_schema — never hardcode column lists.
    Useful signals:
    - Hypothesis: count by status, score-completeness distribution, novelty
    histogram, evidence-link density
    - Papers: ingestion rate, abstract/fulltext/figures coverage, claims-extracted
    - Wiki: page count by content_md length percentile, refs_json density,
    KG-linkage rate, mermaid-diagram coverage
    - Markets: volume distribution, stale-resolution backlog, allocator activity
    - Debates: sessions per hypothesis, scored-vs-unscored, recency
    - KG: edges by type, low-confidence edge count, orphan node count
    - Forge tools: tool_call success rate by tool, last_used_at staleness
    - Senate: belief-snapshot recency, dividend distribution lag, contribution
    credit pipeline depth, quality-gate failure rate

  • Site state (optional, when LLM judges relevant):
  • - curl -s http://127.0.0.1:8000/api/... for surface checks
    - Read data/scidex-artifacts/ ToC (don't load everything)
    - git log --since="24 hours ago" --format="%s%n%b" to see what landed

    Phase 2 — Synthesis (xhigh effort think)

    With state in hand, ask yourself the hard questions. Each answer is
    candidate task material:

    Strategic gaps (priority 95-99 candidates):

    • What capability would change SciDEX's value proposition if shipped this
    week? (Examples that would qualify: a working tournament driver that ranks
    hypotheses by predictive accuracy; a reward-eligibility audit that closes
    the contributor incentive loop; a cross-disease synthesis surface; a
    literature-citation verifier replacing the hallucinated [PMID:NNNN]
    markers; an agent-replay harness so failed tasks don't re-fail identically.)
    • What scientific question is the system uniquely positioned to answer
    this week? (e.g. "Generate 5 falsifiable mechanistic hypotheses connecting
    microglial senescence to TDP-43 proteinopathy, with PubMed-cited evidence
    for each, predictions, and proposed knockout experiments.")
    • Which layers have the least cross-talk and what bridge would help most?
    Capability gaps (priority 90-94 candidates):
    • Which thin-scaffold capability is closest to ready-for-real-use? (e.g. an
    experiment-proposal generator that has a working prompt but no UI surface.)
    • Which feedback loop is open-loop today and would close with one task?
    (e.g. debate quality → hypothesis prior; agent contribution → reward.)

    Health gaps (priority 90-92, only if no recurring CI covers it):

    • Which stuck recurring driver is the most leveraged to unstick? Frame as
    "diagnose + fix + verify resumed throughput" — not "do the work the driver
    was supposed to do."

    Phase 3 — Bounded creation

    For each new task:

  • Title: [Layer] <verb> <object> <specifics> — concrete, scannable.
  • Description: includes (a) why this matters (cite the strategic gap
  • from Phase 2), (b) what success looks like in measurable terms,
    (c) what the agent should read first, (d) what not to do.
  • Priority: 90-99 by leverage. Only use ≥ 95 for genuine
  • capability-shift work; reserve 99 for "this changes the whole system."
  • Layer / quest_id: route via existing layer quests when fit; create
  • no new quests.
  • Spec_path: prefer existing quest layer specs (quest_<layer>_spec.md).
  • For genuinely novel ambitious work, write a new spec file under
    docs/planning/specs/ describing the work in detail; commit it in the
    same worktree before creating the task; reference its path in spec_path.
  • Task type: default iterative with max_iterations=15 for deep
  • work; use one_shot only for clearly-bounded single-PR work.
  • Tags: include quest-engine, layer name, and a topical tag.
  • Payload requirements: set realistic skill floors (reasoning,
  • analysis, coding, creativity, safety).

    Cap creation at max(5, deficit + 2) per cycle. Never exceed 10.

    Dedup: use the MCP create_task server-side check; additionally do a
    fuzzy similarity ≥ 0.7 check against last 60 days of titles (open + done)
    before submitting.

    Phase 4 — Priority audit (every cycle)

    Independent of creation, audit existing open task priorities:

    • Overdue + low priority (e.g. 60d old at priority 50): demote to 30 or
    close as stale (with rationale in completion_summary).
    • Underprioritized strategic work: open task whose description references
    a Phase 2 strategic theme but priority < 85 → bump to 85-90.
    • Overprioritized filler: 95-priority task whose work is also covered
    by an active every-6h recurring driver → demote to 70.
    • Surface contention: if 3+ tasks are racing on the same files, lower
    the priority of all but the most strategic.

    Make ≤ 5 priority adjustments per cycle. Each adjustment: log the before/after
    and the rationale in this spec's Work Log section.

    Phase 5 — Cycle log

    Append a Work Log entry with:

    • UTC timestamp + worker id
    • Phase 1 snapshot summary (≤ 10 lines)
    • Phase 2 strategic gap synthesis (≤ 10 lines)
    • Tasks created (id, title, priority, rationale 1-line each)
    • Priority adjustments (id, before → after, rationale)
    • Stuck recurring drivers flagged (id, days_stale, suggested intervention)
    • Cycle duration

    Recurring tasks already covering common gaps

    Do not create one-shot duplicates of these (snapshot 2026-04-28; refresh
    each cycle by querying tasks WHERE task_type='recurring' AND status='open'):

    GapRecurring driverFrequency
    Hypothesis score updates from new debates[Exchange] CI: Update hypothesis scores from new debate roundsevery-2h
    Belief snapshots[Economics] CI: Snapshot hypothesis prices for price historyevery-2h
    PubMed evidence backfill[Exchange] CI: Backfill evidence_for/evidence_against with PubMed citationsevery-6h
    Discovery dividends[Exchange] Discovery dividend backprop credit (driver #14)every-6h
    Squad dividend multiplier[Exchange] Squad-member dividend multiplier on backprop (driver #23)every-6h
    Reward emission[Exchange] Reward emission for contributions (driver #5)every-6h
    Token bounty issuance[Exchange] Token bounty issuance for open work (driver #4)every-6h
    Funding allocator[Exchange] Funding allocator activation (driver #10)every-6h
    Quadratic funding[Exchange] Quadratic funding allocator (driver #15)every-6h
    Wiki citation enrichment[Atlas] Wiki citation enrichment — add inline citationsevery-6h
    Wiki ↔ KG cross-linking[Atlas] CI: Cross-link new wiki pages to KG entitiesevery-6h
    Wiki mermaid regen[Atlas] Wiki mermaid LLM regenevery-6h
    Paper figures[Atlas] Extract and reference figures from scientific papersevery-2h
    Paper abstracts[Forge] Reduce PubMed metadata backlog for papers missing abstractsevery-6h
    Notebook coverage[Artifacts] CI: Verify notebook coverageevery-12h
    Stub notebook regen[Artifacts] Audit all 67 stub notebooksevery-6h
    Debate sessions[Agora] CI: Trigger debates for analyses with 0 debate sessionsevery-24h
    Debate quality scoring[Agora] CI: Run debate quality scoring on new/unscored sessionsevery-6h
    Counter-argument bounties[Agora] Counter-argument bounty market (driver #7)every-6h
    Squad enrollment[Agora] Squad open enrollment & recruitment (driver #21)every-6h
    Tool call failure triage[Forge] CI: Test all scientific tools for availabilitydaily
    World-model improvements[Senate] World-model improvement detector (driver #13)every-6h
    Agent contribution credit[Senate] Agent contribution credit pipeline (driver #11)every-6h
    Abandoned-run watchdog[Senate] CI: Abandoned-run watchdogevery-1h
    Strategic engine guardian[Senate] Strategic engine guardianevery-15-min
    Orchestra operator watchdog[Senate] Orchestra operator watchdog and self-repair loopevery-2h
    If a row-count gap matches one of these recurring drivers AND that driver
    has run in the last 24h, do not generate a one-off duplicate. If the driver
    is stale, prefer "unstick the driver" framing over "do the chipping."

    Critical constraints

    • Pri 99 every-30-min: this spec must run regardless of queue depth
    (priority audit always happens; creation only when invariant violated).
    • xhigh effort: agents picking this up should run with maximum
    reasoning effort. The supervisor must route to a slot capable of
    reasoning >= 10. (Note: per auth._ensure_claude_launch_flags, the
    --effort xhigh flag is auto-repaired on acquire; non-interactive launch
    is safe.)
    • Provider any: claude or codex; both are acceptable. Codex is the
    default for cycles requiring deep code reading.
    • Idempotent: re-running this task immediately should be a no-op
    (queue invariant satisfied → no creation; priority audit converges).
    • No destructive action: this task only creates tasks and adjusts
    priorities. It never deletes or merges artifacts. Never close another
    task's lifecycle.
    • Bounded blast radius: ≤ 10 task creations per cycle, ≤ 5 priority
    adjustments per cycle. Anything larger needs operator gate.
    • Worktree discipline: any new spec files must be written in a
    worktree (per SciDEX guard-main-writes hook), committed, and PR'd.
    Do not write to the main checkout.
    • No new quests: route to existing quests; if no quest fits, route to
    the most relevant layer quest (qe-<layer> family).

    Acceptance criteria

    ☐ Each cycle reads queue + recurring health + world-model state in
    Phase 1 before any creation
    ☐ Each cycle produces a Work Log entry with the structure described in
    Phase 5
    ☐ Steady state: ≥ 5 open non-CI tasks at priority ≥ 90
    ☐ Created tasks pass the "ambitious" bar (at least 2 of the 5 criteria)
    ☐ Created tasks do not duplicate active recurring CI work
    ☐ Stale recurring drivers (>24h since last_completed_at) are flagged in
    every cycle's log, regardless of whether intervention happened

    Dependencies

    • /home/ubuntu/scidex/docs/planning/specs/ — quest specs to read
    • /home/ubuntu/scidex/docs/design/retired_scripts_patterns.md — design principles
    • /home/ubuntu/Orchestra/orchestra.db — task queue (read via MCP, write via MCP)
    • scidex.core.database.get_db_readonly — SciDEX PostgreSQL access
    • Orchestra MCP tools: list_tasks, create_task, update_task, list_quests

    Dependents

    • The agent fleet's productivity at any given moment is bounded by the
    ambition of the queue this task generates.
    • Strategic engine guardian (7a9c642b) reopens this if it gets blocked.
    • Watchdog (433698d3 family) creates [Watchdog] Fix: tasks when this
    abandons too often — those should be diagnosed, not silenced.

    Work Log

    2026-04-28 — Bootstrap (rewrite from quest-engine-ci.md)

    Rewrote spec to replace the queue-depth-only template generator with an
    xhigh-effort LLM-driven strategic generator. Prior approach (quest_engine.py
    + quest-engine-ci.md) produced filler one-shots that mirrored existing
    every-6h recurring drivers without adding strategic leverage. New spec:

    • Gap predicate is "≥5 non-CI priority-≥90 tasks", not "queue size < 50"
    • Every cycle reads recurring CI health and refuses to chip at gaps that
    active drivers already own
    • Phases mandate state snapshot → strategic synthesis → bounded creation
    → priority audit → log
    • "Ambitious" bar: capability work, scientific output, value-prop shift
    • Hard prohibition on row-count backfill duplicates of existing drivers

    Operators: see [Watchdog] Fix: tasks 433698d3, ff42baa0, 2fb403f8, 1fd42f17 for the abandonment history of the prior implementation.

    Cycle 1 — 2026-04-28T09:11Z (first agent run on the new spec)

    World model snapshot (Phase 1):

    MetricValueTrend
    Hypotheses1,873+1,117 in 7d
    All in proposed lifecycle1,873 / 1,873CRITICAL
    Debated hypotheses1,866
    Avg composite score0.561
    Top composite score0.96
    Debates (7d)501active
    Knowledge gaps3,5453,153 open
    Gaps resolved12 / 3,5450.34%
    Wiki pages17,662
    KG edges697,224
    Papers29,503
    Analyses470
    Benchmarks0CRITICAL
    ML models (artifacts)9low
    Debate sessions835 total
    Queue health: 56 open non-CI tasks at priority ≥ 90 (gap predicate met),
    97 recurring drivers active, only 3 one-shot tasks open (low for a
    platform with this activity level).

    Strategic gaps synthesized (Phase 2):

  • Hypothesis lifecycle frozen — every hypothesis stuck in proposed;
  • no promotion engine exists. World model can never improve.
  • Gap resolution rate 0.34% — 3,545 gaps, 12 resolved; gap tracker is
  • noise, not signal.
  • Zero benchmarks — Forge has no computational benchmarks; platform
  • lacks predictive-validity demonstration.
  • Debate transcripts unused for causal KG — 835 debates contain
  • mechanistic A→B→C reasoning never extracted as KG edges.
  • No cross-disease mechanism mining — debates siloed by disease;
  • no AD/PD/ALS/FTD analogy miner.
  • Top hypotheses lack computational validation — top 25 at
  • composite_score ≥ 0.88 debated but not computationally validated.

    Ambitious tasks created (Phase 3):

    IDTitlePriority
    7db6be96[Agora] Hypothesis lifecycle promotion engine — debate-to-validated pipeline94
    c17abaf7[Forge] Computational validation of top 25 hypotheses — enrichment + expression analyses93
    f4f7b129[Atlas] Gap closure pipeline — match 500 open gaps to evidence and resolve92
    72f50712[Agora] Debate transcript causal claim extractor — mine 835 debates for mechanistic KG edges91
    1186a9ab[Forge] Seed neurodegeneration ML benchmark registry — 5 prediction tasks with answer keys91
    1b1ebf23[Agora] Cross-disease mechanism analogy miner — transfer AD/PD/ALS/FTD insights90
    All six pass the "ambitious" bar (capability work, scientific-output,
    or value-prop shift — not row-count backfill). None duplicate any of
    the 26 recurring drivers in the inline overlap table.

    Result observed at 10:16 UTC: Task 72f50712 (debate causal extractor)
    already at 2,020 KG edges from 80/200 sessions — exceeded its 2,000-edge
    target. The new generator is producing tasks that actually move the needle.

    Cycle 2 — 2026-04-28T18:35Z (worker claude-auto:41)

    Phase 1 — World model snapshot:

    MetricValueChange from C1
    Hypotheses1,878+5
    Proposed1,216 / 1,878DOWN from 1,873 (promotion engine working)
    Promoted216new status
    Active133new status
    Avg composite score0.561stable
    Hypotheses score > 0.877
    Knowledge gaps3,545same; 3,153 open; 12 resolved (0.34%)
    Wiki pages17,662stable
    KG edges (PostgreSQL)2,316up from ~800 (debate extractor impact visible)
    Papers29,503stable
    Analyses471+1
    Debate sessions841+6
    Artifacts51,832
    Gap predicate result: 0 non-CI one_shot/iterative tasks at priority ≥90 (VIOLATED; need ≥5).
    Queue: 120 recurring drivers, 3 iterative tasks (all at priority 83–86), 0 one_shot.

    Stale recurring drivers flagged (>24h since last_completed_at):

    DriverPriorityDays StaleNote
    Database integrity check9826dCRITICAL systemic failure
    Hypothesis score update9626dData quality frozen
    Trigger debates9426dAgora input frozen
    Evidence backfill9426dEvidence layer frozen
    Debate quality scoring9326dQuality signals frozen
    King of the Hill tournament9722dCore value prop broken
    Evolve economics9724dMarkets frozen
    Phase 2 — Strategic gaps synthesized:

  • Gap predicate violated: 0 eligible tasks (down from 56 in Cycle 1 — those were miscounted as recurring tasks). Immediate creation needed.
  • 8+ recurring drivers all stale since ~2026-04-02: shared stale date = systemic root cause, not individual driver bugs. Fixing it at the source is the highest-leverage intervention.
  • King of the Hill tournament 22d stale: Platform's core hypothesis-ranking mechanism is non-functional. Prediction markets are frozen.
  • KG still sparse: 2,316 edges despite ~760 unprocessed debate sessions that yield ~19 edges each. Task 72f50712 proved the extraction works (1,498 edges from 80 sessions on 2026-04-16); completing it = ~7x KG density.
  • Gap resolution rate 0.34% unchanged: No resolution engine exists. Building one converts 3,153 open gaps from noise to progress signal.
  • Cross-disease synthesis absent: 841 debates span AD/PD/ALS/FTD but no synthesis surface extracts shared pathways — unique scientific value only SciDEX can produce at scale.
  • Top hypotheses lack structured output: 77 hypotheses at composite_score > 0.8 exist as DB rows; no research brief format exists for researchers to act on.
  • Phase 3 — Tasks created:

    IDTitlePriorityRationale
    565ac01b[Senate] Diagnose and fix 26d+ stale recurring CI drivers968+ drivers shared stale date = systemic root cause; fix the engine not the symptoms
    a3018033[Arenas] Fix King of the Hill tournament driver96Core hypothesis-ranking mechanism broken 22d; prediction markets non-functional
    31eeae8d[Agora] Gap resolution engine — auto-close gaps953,153 open gaps, 12 resolved; build the resolution pipeline, not one-off chips
    8ebfce57[Atlas] Complete KG extraction from ~760 debate sessions942,316 edges; ~7x density possible; 72f50712 proved approach works
    ffd81f3a[Agora] Cross-disease mechanism synthesizer93841 siloed debates; shared AD/PD/ALS/FTD mechanisms = unique scientific value
    33dca458[Agora] Research briefs for top 25 hypotheses9277 high-score hypotheses as bare DB rows; structured reports = platform's core deliverable
    All 6 pass the ambitious bar (≥2 of: capability, measurable output change, scientific question, feedback loop, cross-layer integration). None duplicate active recurring CI work.

    Phase 4 — Priority adjustments:

    TaskBeforeAfterRationale
    66c83cdc [Forge] Benchmark answer-key migration to dataset registry8993Strategic infra: benchmarks still zero, this is the pipeline to fix it
    55e3ea08 [Atlas] Reduce wiki-to-KG linking backlog8250Covered by recurring [Atlas] CI: Cross-link new wiki pages to KG entities; demoted until driver is unstuck
    2f7e1600 [Agora] Add counter-evidence reviews to 10 hypotheses8660Mirrors stale [Exchange] CI: Backfill evidence_for/evidence_against driver; new stale-driver task (565ac01b) will unstick the driver
    Cycle outcome: Gap predicate was 0 → will be ≥6 after these tasks are claimed. Six spec files committed to worktree at SHA 90904c4f4. Stale driver diagnosis (565ac01b) is the highest-leverage task: fixing it would unblock 8+ drivers and restore data quality across all layers.

    Cycle 3 — 2026-04-28T22:25Z (worker claude-auto, task 80ffb77b)

    Phase 1 — World model snapshot:

    MetricValueChange from C2
    Hypotheses1,967+89
    Proposed1,277 / 1,967stable
    Promoted221+5 from C2
    Active156+23
    Validated0CRITICAL — unchanged
    Avg composite score0.564+0.003
    Hypotheses score ≥ 0.888+11
    Knowledge gaps3,545stable
    Open gaps2,635down from 3,153 (gap resolution engine working)
    Resolved gaps529up from 12 — massive improvement
    KG edges2,316same total; 518 added last 24h (extraction ongoing)
    Debate sessions849+8 from C2
    Papers29,560+57
    Wiki pages17,667+5
    Experiments724new (2bad68f6 task shipped)
    Hypothesis predictions3,729new (bulk predictions exist)
    Benchmarks6new (seeded by 1186a9ab)
    Benchmark submissions1only 1 submission for 6 benchmarks
    Prediction markets84stable
    Cross-disease analogies18new (ffd81f3a shipped)
    Gap predicate result: 3 (< 5, gap predicate VIOLATED; deficit = 2, creating 3 tasks)
    Queue: 120 recurring drivers, 3 iterative tasks (d3d8cace @95, 8ebfce57 @94, 03094ddf @92).

    Stale recurring drivers note: All 120 recurring tasks show last_completed_at = NULL
    (999d stale), which is a known tracking artifact — the field isn't being populated, not a
    real staleness signal. No action taken on this since a one-shot duplicate would be filler.

    Recent merged work (last 48h, notable):

    • 8180d43b2 — KG edge extraction (8ebfce57) merged; still iterating
    • b3668c46b — KG causal extraction (d3d8cace) merged; still iterating
    • a840eae9e — Cross-disease mechanism synthesizer (ffd81f3a) — 18 analogies
    • 0bb597081, bf472125e — Research briefs (33dca458) — merged
    • 8ea8cc96b — Experiment proposal generator (2bad68f6) — 724 experiments created
    • df1ad15de — Gap resolution engine (31eeae8d) — 529 gaps resolved
    • 473783fee — Gap closure pipeline (f4f7b129) — batch gap closure
    Phase 2 — Strategic gaps synthesized:

  • 0 validated hypotheses (CRITICAL): 221 promoted, 88 composite_score≥0.8, 3,729
  • predictions, 724 experiments — yet no hypothesis is "validated". The lifecycle gate is absent.
    This is the most important scientific output gap on the platform.
  • 6 benchmarks with 1 submission: Seeding task (1186a9ab) built the registry but not
  • the evaluator. Hypotheses have no benchmark scores — they can't be compared quantitatively.
  • Forge↔Exchange bridge missing: 724 experiments and 84 prediction markets are disconnected.
  • No pathway from experiment→market exists. Researchers can't trade on experimental predictions.
  • KG edge growth active but slow: 518 edges added in 24h but total still 2,316 with 849
  • debates to mine. Tasks d3d8cace and 8ebfce57 are still running iteratively; no new task needed.
  • All recurring drivers show NULL last_completed_at: Tracking bug, not real staleness.
  • Not actionable via new one-shots.

    Phase 3 — Tasks created:

    IDTitlePriorityRationale
    (pending)[Agora] Hypothesis validation gate950 validated out of 88 eligible; lifecycle gate is absent; closes the core science loop
    (pending)[Forge] Benchmark evaluation harness926 benchmarks, 1 submission; evaluation harness missing; turns debates into testable predictions
    (pending)[Exchange] Experiment-to-prediction-market bridge91724 experiments + 84 markets disconnected; Forge↔Exchange cross-layer bridge
    All 3 pass the ambitious bar (≥2 of: capability, measurable output change, scientific question,
    feedback loop, cross-layer integration). None duplicate active recurring CI work.

    Phase 4 — Priority adjustments:

    No priority adjustments this cycle. The 3 open non-recurring tasks (d3d8cace @95, 8ebfce57 @94,
    03094ddf @92) are appropriately prioritized — all are active iterative tasks producing real work.
    Recurring task staleness values (all 999d) are a tracking artifact, not real stale dates.

    Cycle outcome: Gap predicate 3 → will be ≥6 after these 3 tasks are claimed. Three new
    spec files committed. Top priority: validation gate (0 validated hypotheses is the starkest gap
    visible in the world model).

    Cycle 4 — 2026-04-29T03:05Z (worker claude-auto, task 80ffb77b)

    Phase 1 — World model snapshot:

    MetricValueChange from C3
    Hypotheses1,974+7
    Proposed1,284+7
    Promoted221stable
    Active156stable
    Validated statusdoes not exist in schemaCRITICAL
    High-score (≥0.8)88stable
    Avg composite score0.564stable
    KG gaps3,545stable
    Open gaps2,635stable
    Resolved gaps529stable
    Papers29,566+6
    Wiki pages17,682+15
    KG edges (kg_edges)2,366+50
    Causal edges (causal_edges)19,753HUGE — wiki-extracted
    Debate sessions850+1
    Experiments724stable
    Benchmarks6stable
    Benchmark submissions227was 1 in C3 — harness working!
    Predictions pending3,719+12
    Predictions confirmed9NEW
    Predictions falsified5NEW
    Prediction markets139+120 in 24h (experiment bridge active)
    Resolved markets23+23 in 24h (resolution engine active)
    Convergence reports61NEW — previously unseen
    Cross-disease analogies18stable
    Gap predicate result: 1 (< 5, VIOLATED; only 8ebfce57 qualifies; deficit = 4)
    Queue: 118 recurring, 1 iterative (8ebfce57 @94), 0 one-shot non-CI.

    Stale recurring drivers: All 118 recurring tasks show last_completed_at = NULL
    (999d stale) — confirmed tracking artifact, not real staleness. No action.

    Notable recent merges:

    • 2a40d719a — Atlas mechanism consensus map (1,974 hypotheses classified by pathway)
    • 136cc0f22 — Forge benchmark evaluation harness (shipped 226 new submissions)
    • 56726030f — Exchange experiment-to-prediction-market bridge (120 markets in 24h)
    • ca6dfe65a — Hypothesis validation gate (merged but 'validated' status doesn't exist in schema)
    • 535106909 — Prediction market resolution engine (23 resolved in 24h)
    • edb667451 — Hypothesis prediction contradiction detector
    Phase 2 — Strategic gaps synthesized:

  • 'validated' status missing from schema (CRITICAL): validation gate (ca573a56) shipped
  • but hypotheses.status has no 'validated' option — status enum: proposed/promoted/active/
    debated/archived/open/superseded. 88 hypotheses at composite_score ≥ 0.8 have no
    scientific endpoint. Platform cannot deliver validated science.
  • 19K causal edges siloed: causal_edges table (19,753, wiki-extracted, free-text entities,
  • mechanism_description, evidence_pmids) is disconnected from kg_edges (2,366, typed entity refs).
    10x KG density increase possible via entity resolution bridge.
  • 61 convergence reports unsurfaced: convergence_reports table has 61 rows but they're not
  • quality-scored, not UI-surfaced, not linked to hypotheses — potentially the platform's first
    scientific synthesis output, invisible to users.
  • 3,741 predictions 99.6% unevaluated: infrastructure exists (9 confirmed, 5 falsified found
  • manually), but systematic evaluation pipeline is absent. Evaluating predictions demonstrates
    platform's predictive validity.
  • 120 prediction markets in 24h without quality gate: experiment bridge is working but
  • creating noise. Markets need resolution criteria, non-50/50 priors, and liquidity > 0.

    Phase 3 — Tasks created:

    IDTitlePriorityRationale
    0b9657cd[Agora] Add 'validated' hypothesis lifecycle status96'validated' missing from schema; 88 eligible; closes scientific closure loop
    12c461ae[Atlas] Causal KG entity resolution9519K free-text edges → KG bridge; 10x density; Atlas↔Agora integration
    712ca5de[Atlas] Surface and score 61 convergence reports93First scientific synthesis output; 61 reports exist but invisible
    2c4b95b0[Agora] Falsifiable prediction evaluation pipeline923,741 pending predictions; 0.37% evaluated; closes prediction feedback loop
    41ee05e3[Exchange] Prediction market quality gate91120 markets in 24h; quality gate missing; protects market signal integrity
    All 5 pass the ambitious bar (≥2 of: capability, measurable output, scientific question,
    feedback loop, cross-layer integration). None duplicate any of the 26 recurring drivers.

    Phase 4 — Priority adjustments:

    No adjustments this cycle. The 1 open non-recurring task (8ebfce57 [Atlas] KG edge
    extraction @94) is appropriately prioritized and actively iterating. Recurring driver
    staleness values are tracking artifacts. New tasks span P91-P96 with appropriate gradation.

    Cycle outcome: Gap predicate was 1 → will be 6 (5 new + 8ebfce57) after tasks are claimed.
    5 spec files committed (SHA 28ad1fe9d). Top priority: 0b9657cd — 'validated' status missing
    from schema is the most critical scientific lifecycle gap; without it, the platform cannot
    deliver its core promise (validated science). Secondary priority: 12c461ae causal KG bridge
    (10x KG density increase with existing data).

    Cycle 5 — 2026-04-29T03:55Z (worker claude-auto, task 80ffb77b)

    Phase 1 — World model snapshot:

    MetricValueChange from C4
    Hypotheses1,974stable
    Proposed1,284stable
    Promoted221stable
    Active156stable
    Validated0CRITICAL — unchanged (validation task in flight)
    High-score (≥0.8)88stable
    Avg composite score0.564stable
    Knowledge gaps3,545stable
    Open gaps2,635stable
    Resolved gaps529stable
    Wiki pages17,682stable
    KG edges2,366stable
    Causal edges19,753stable
    Papers29,566stable
    Debate sessions864+14
    Experiments724stable
    Benchmarks6stable
    Benchmark submissions227stable
    Convergence reports61stable
    Prediction markets139 (80 active, 23 resolved, 36 cancelled)stable
    Tournaments112 (95 complete, 16 open, 1 in_progress)stable
    Tournament matches2,207stable
    Targets185stable
    Target dossier rows0CRITICAL — new gap identified
    Wiki quality scores5,563
    World model improvements1,309
    Gap predicate result: 1 (only 8ebfce57 open; Cycle 4 tasks 0b9657cd, 12c461ae, 712ca5de,
    2c4b95b0 all moved to running status within 30min of creation — gap predicate counts open/available only, so running tasks don't qualify). Deficit = 4.

    Running non-CI tasks at p≥90 (not counted by predicate, all in flight): 0b9657cd p=96 (validated lifecycle), 12c461ae p=95 (causal KG), d924270b p=94
    (hypothesis advancement optimizer — new, created outside this generator), 712ca5de p=93
    (convergence reports), 03094ddf p=92 (falsifiable predictions), 2c4b95b0 p=92
    (prediction evaluation), 17f3b8e0 p=91 (mechanistic claim verifier).

    Stale recurring drivers: All recurring tasks show last_completed_at = NULL (999d stale)
    — confirmed tracking artifact from prior cycles, not real staleness. No action.

    Phase 2 — Strategic gaps synthesized:

  • 185 targets, 0 dossier rows (CRITICAL): The target_dossier table has zero rows
  • despite 185 neurodegeneration drug targets existing. No running task addresses this.
    Targets are the translational endpoint — the entities drug discoverers act on.

  • Debate sessions as unused hypothesis source: 864 debate sessions contain structured
  • multi-agent reasoning (claims with agent confidence, skeptic concessions, consensus
    points) that has never been mined for NEW hypothesis candidates. This is richer than
    single-LLM generation — cross-agent consensus is a quality signal.

  • No hypothesis-target linkage: 1,974 hypotheses and 185 targets exist in parallel
  • with no systematic relevance scoring between them. unit_hypothesis_links exists but
    population is unknown. Linking hypotheses to targets creates the drug discovery ranking.

  • Agent contribution loop unverified: agent_contributions, agent_reputation,
  • token_reward_events tables exist with recurring drivers, but the full pipeline from
    task completion → reputation update → token emission has never been audited end-to-end.
    A broken link means the incentive system is non-functional.

  • 9 ML models vs 6 benchmarks, no leaderboard: The Forge↔Atlas measurement loop
  • (artifacts evaluated on benchmarks) has no summary leaderboard despite 227 submissions.

    Phase 3 — Tasks created:

    IDTitlePriorityRationale
    fc9309fd[Atlas] Neurodegeneration target dossier pipeline96185 targets, 0 dossiers; direct translational output; closes Atlas-to-drug-discovery loop
    4ba968e0[Agora] Debate-to-hypothesis synthesis engine94864 debate sessions unused as hypothesis source; multi-agent consensus = quality signal
    14954567[Atlas] Hypothesis-target relevance scoring93No hypothesis-target linkage; creates drug discovery priority ranking
    3ff23dae[Senate] Agent contribution feedback loop audit91Full pipeline from task→reputation→token never verified end-to-end
    fb4e9333[Forge] ML model vs benchmark head-to-head evaluation909 models + 6 benchmarks + 227 submissions but no leaderboard
    All 5 pass the ambitious bar (≥2 of: capability, measurable output change, scientific
    question, feedback loop, cross-layer integration). None duplicate any of the 26 recurring
    drivers or the 7 tasks currently running.

    Phase 4 — Priority adjustments:

    No adjustments this cycle. The 7 running non-CI tasks (p=91–96) are appropriately
    prioritized for their strategic value. Recurring task staleness values are tracking
    artifacts. No priority misalignment detected.

    Cycle outcome: Gap predicate was 1 → will be ≥6 after new tasks are claimed.
    5 spec files committed SHA 5b69529dd. Top priority: fc9309fd (target dossier pipeline
    — 185 targets with 0 dossiers is the starkest new gap). Secondary: 4ba968e0
    (debate-to-hypothesis synthesis — unused high-quality source for novel hypotheses).

    Payload JSON
    {
      "completion_shas": [
        "b6ef46508"
      ],
      "completion_shas_checked_at": "2026-04-21T04:33:08.236839+00:00",
      "_watchdog_repair_task_id": "1fd42f17-02b4-4cc6-a247-cf87150eabd4",
      "_watchdog_repair_created_at": "2026-04-22T20:06:43.678495+00:00",
      "requirements": {
        "reasoning": 10,
        "analysis": 9,
        "coding": 7,
        "safety": 8,
        "instruction_following": 9
      },
      "_stall_skip_providers": [
        "glm"
      ]
    }

    Sibling Tasks in Quest (Senate) ↗

    Task Dependencies

    ↓ Referenced by (downstream)