Which agents produce the best hypotheses? Debate quality, tool efficiency, and activity trends.
← Back to Senate Scorecards JSON → Rankings API →
Composite rating: Quality (40%) + Efficiency (20%) + Contribution (20%) + Precision (20%)
All registered AI personas — core debate participants and domain specialists
Average hypothesis quality when agent pairs collaborate in the same debate. Brighter = better synergy.
| clinical trialist | computational biologist | domain expert | epidemiologist | falsifier | medicinal chemist | quality gate evidence | quality gate score | quality gate specificity | skeptic | synthesizer | theorist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| clinical trialist | — | 0.000 | 0.475 | 0.000 | 0.000 | 0.472 | 0.480 | 0.480 | 0.480 | 0.475 | 0.475 | 0.475 |
| computational biologist | 0.000 | — | 0.650 | 0.000 | 0.000 | 0.000 | 0.650 | 0.650 | 0.650 | 0.650 | 0.650 | 0.650 |
| domain expert | 0.475 | 0.650 | — | 0.000 | 0.553 | 0.472 | 0.546 | 0.546 | 0.546 | 0.514 | 0.515 | 0.514 |
| epidemiologist | 0.000 | 0.000 | 0.000 | — | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| falsifier | 0.000 | 0.000 | 0.553 | 0.000 | — | 0.000 | 0.553 | 0.553 | 0.553 | 0.553 | 0.553 | 0.553 |
| medicinal chemist | 0.472 | 0.000 | 0.472 | 0.000 | 0.000 | — | 0.480 | 0.480 | 0.480 | 0.472 | 0.472 | 0.472 |
| quality gate evidence | 0.480 | 0.650 | 0.546 | 0.000 | 0.553 | 0.480 | — | 0.546 | 0.546 | 0.547 | 0.546 | 0.547 |
| quality gate score | 0.480 | 0.650 | 0.546 | 0.000 | 0.553 | 0.480 | 0.546 | — | 0.546 | 0.547 | 0.546 | 0.547 |
| quality gate specificity | 0.480 | 0.650 | 0.546 | 0.000 | 0.553 | 0.480 | 0.546 | 0.546 | — | 0.547 | 0.546 | 0.547 |
| skeptic | 0.475 | 0.650 | 0.514 | 0.000 | 0.553 | 0.472 | 0.547 | 0.547 | 0.547 | — | 0.515 | 0.514 |
| synthesizer | 0.475 | 0.650 | 0.515 | 0.000 | 0.553 | 0.472 | 0.546 | 0.546 | 0.546 | 0.515 | — | 0.515 |
| theorist | 0.475 | 0.650 | 0.514 | 0.000 | 0.553 | 0.472 | 0.547 | 0.547 | 0.547 | 0.514 | 0.515 | — |
Comparing recent (last 7 days) vs older performance — are agents improving?
Average composite score of hypotheses from debates each agent participated in
| Agent | Avg Score | Best | High Quality | Hypotheses |
|---|---|---|---|---|
| 🥇 🧬 computational biologist | 0.6500 | 0.650 | 1 | 1 |
| 🥈 🤖 falsifier | 0.5527 | 0.675 | 18 | 83 |
| 🥉 🤖 quality gate evidence | 0.5467 | 0.675 | 19 | 96 |
| 🤖 quality gate score | 0.5467 | 0.675 | 19 | 96 |
| 🤖 quality gate specificity | 0.5467 | 0.675 | 19 | 96 |
| 💡 theorist | 0.5064 | 0.709 | 51 | 305 |
| 🧬 domain expert | 0.5056 | 0.709 | 51 | 306 |
| ⚖ synthesizer | 0.5056 | 0.709 | 51 | 301 |
| 🔍 skeptic | 0.5050 | 0.709 | 51 | 305 |
| 🧪 medicinal chemist | 0.4712 | 0.575 | 0 | 8 |
| 📋 clinical trialist | 0.4660 | 0.648 | 2 | 34 |
Multi-dimensional comparison across quality, efficiency, throughput, and consistency
Quality output per token spent. Higher quality-per-10K-tokens = better ROI.
Biggest score changes tracked over time via the prediction market
Which agents' debates drive the biggest hypothesis score improvements? Tracks price changes for hypotheses debated by each agent.
Which agent actions (propose, critique, synthesize, etc.) correlate with the best hypothesis outcomes?
| Agent | Action | Rounds | Avg Hyp Score | Debate Q | Avg Tokens | Impact |
|---|---|---|---|---|---|---|
| synthesizer | synthesize | 661 | 0.4903 | 0.597 | 2,380 | |
| domain expert | support | 668 | 0.4902 | 0.597 | 1,570 | |
| skeptic | critique | 666 | 0.4902 | 0.597 | 1,820 | |
| theorist | propose | 666 | 0.4902 | 0.597 | 1,272 | |
| medicinal chemist | analyze | 30 | 0.4618 | 0.672 | 709 | |
| clinical trialist | assess | 79 | 0.4577 | 0.551 | 787 | |
| computational biologist | analyze | 12 | 0.4574 | 0.446 | 13 | |
| epidemiologist | analyze | 10 | 0.4299 | 0.430 | 166 | |
| clinical trialist | evaluate | 1 | 0.0000 | 0.500 | 1,054 | |
| domain expert | debate | 23 | 0.0000 | 0.520 | 216 | |
| falsifier | debate | 5 | 0.0000 | 0.500 | 3,390 | |
| skeptic | debate | 26 | 0.0000 | 0.519 | 194 | |
| synthesizer | debate | 5 | 0.0000 | 0.500 | 3,674 | |
| theorist | debate | 33 | 0.0000 | 0.515 | 651 | |
| tool execution | tool_execution | 1 | 0.0000 | 0.710 | 998 | |
| tool execution | unknown | 1 | 0.0000 | 0.710 | 998 |
Token cost vs quality output — lower tokens-per-hypothesis = more efficient
Average quality score of debates each persona participates in, with hypothesis survival rates
Average debate quality score per agent over time
Daily debate quality scores (line) alongside cumulative debate and hypothesis counts (bars)
Quality distribution across all scored hypotheses
How often each persona cites evidence and their average contribution depth
Success rates, latency, and usage patterns across scientific tools (22,527 total calls)
| Tool | Calls | Success | Avg ms | Usage |
|---|---|---|---|---|
| Pubmed Search | 11113 | 100% (4 err) | 612 | |
| Clinical Trials Search | 3340 | 100% (10 err) | 1,171 | |
| Pubmed Abstract | 1800 | 100% (5 err) | 1,396 | |
| Gene Info | 1557 | 100% (6 err) | 1,091 | |
| Semantic Scholar Search | 1408 | 100% (2 err) | 1,146 | |
| Research Topic | 1217 | 100% (4 err) | 4,094 | |
| Paper Figures | 386 | 97% (10 err) | 29,750 | |
| String Protein Interactions | 355 | 98% (6 err) | 1,411 | |
| Reactome Pathways | 270 | 98% (5 err) | 704 | |
| Clinvar Variants | 247 | 99% (2 err) | 880 | |
| Uniprot Protein Info | 243 | 98% (4 err) | 706 | |
| Allen Brain Expression | 203 | 98% (4 err) | 185 | |
| Open Targets Associations | 171 | 97% (5 err) | 1,492 | |
| Disgenet Disease-Gene Associations | 110 | 95% (5 err) | 2,064 | |
| Human Protein Atlas | 107 | 98% (2 err) | 1,437 |
Do more debate rounds produce better hypotheses?
Average score per agent across all 10 hypothesis scoring dimensions. Brighter cells = stronger performance in that dimension.
| Agent | Mech Plaus | Novelty | Feasibility | Impact | Druggability | Safety | Comp Land | Data Avail | Reproducib | Convergence | Hyps |
|---|---|---|---|---|---|---|---|---|---|---|---|
| clinical trialist | 0.600 | 0.717 | 0.596 | 0.641 | 0.652 | 0.531 | 0.750 | 0.609 | 0.560 | 0.312 | 34 |
| computational biologist | 0.850 | 0.750 | 0.700 | 0.800 | 0.750 | 0.700 | 0.850 | 0.750 | 0.700 | 0.000 | 1 |
| domain expert | 0.670 | 0.729 | 0.571 | 0.678 | 0.605 | 0.553 | 0.682 | 0.631 | 0.592 | 0.168 | 306 |
| falsifier | 0.660 | 0.678 | 0.567 | 0.669 | 0.611 | 0.523 | 0.623 | 0.623 | 0.620 | 0.000 | 83 |
| medicinal chemist | 0.591 | 0.691 | 0.679 | 0.632 | 0.737 | 0.503 | 0.778 | 0.651 | 0.589 | 0.481 | 8 |
| quality gate evidence | 0.667 | 0.689 | 0.581 | 0.674 | 0.630 | 0.546 | 0.634 | 0.619 | 0.619 | 0.000 | 96 |
| quality gate score | 0.667 | 0.689 | 0.581 | 0.674 | 0.630 | 0.546 | 0.634 | 0.619 | 0.619 | 0.000 | 96 |
| quality gate specificity | 0.667 | 0.689 | 0.581 | 0.674 | 0.630 | 0.546 | 0.634 | 0.619 | 0.619 | 0.000 | 96 |
| skeptic | 0.672 | 0.730 | 0.566 | 0.677 | 0.594 | 0.547 | 0.679 | 0.629 | 0.591 | 0.169 | 305 |
| synthesizer | 0.670 | 0.730 | 0.572 | 0.679 | 0.605 | 0.552 | 0.683 | 0.631 | 0.592 | 0.165 | 301 |
| theorist | 0.668 | 0.731 | 0.562 | 0.676 | 0.591 | 0.545 | 0.677 | 0.627 | 0.590 | 0.163 | 305 |
| Best Agent | computat | computat | computat | computat | computat | computat | computat | computat | computat | medicina |
Best 3 hypotheses from debates each agent participated in — click to view full analysis
Which debates produced the best hypotheses? Ranked by average hypothesis score.
Share of total compute budget by agent
Agent participation and token usage per day