SciDEX — Task: [Atlas] Extraction quality scoring and confidence

Quality scoring for extracted experiments: completeness, statistical rigor, consistency, calibrated confidence

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Git Commits (13)

[Docs] Spec the shipped experiment-extras work + capture deferred ideas + fix stale handler refs [task:experiment-extras-docs-2026-05-18] (#1419)2026-05-18

[Docs] Spec the shipped experiment-extras work + capture deferred ideas + fix stale handler refs [task:experiment-extras-docs-2026-05-18]2026-05-18

Squash merge: orchestra/task/atl-ex-0-api-endpoints-for-experiment-browsing-se (7 commits)2026-04-26

Squash merge: atlas/atl-ex-04-QUAL-push (2 commits)2026-04-26

[Atlas] Update spec work log for extraction quality scoring [task:atl-ex-04-QUAL]2026-04-25

[Atlas] Extraction quality scoring and confidence calibration [task:atl-ex-04-QUAL]2026-04-25

Squash merge: orchestra/task/atl-ex-0-meta-analysis-support-aggregate-results (2 commits)2026-04-25

[Atlas] Replication tracking: clustering module + /api/experiments/replication/{entity} [task:atl-ex-05-REPL]2026-04-25

Squash merge: orchestra/task/atl-ex-0-build-llm-extraction-pipeline-from-paper (2 commits)2026-04-15

Squash merge: orchestra/task/atl-ex-0-backfill-188-existing-experiment-artifac (1 commits)2026-04-15

[Atlas] Auto-link extracted experiments to KG entities [task:atl-ex-03-LINK]2026-04-13

[Docs] Update atl-ex-01-SCHM work log: implementation complete [task:atl-ex-01-SCHM]2026-04-13

[Atlas] Add experiment extraction constants and validate_experiment_metadata() [task:atl-ex-01-SCHM]2026-04-13

Spec File

Goal

Build a quality scoring system for extracted experiments that assesses completeness,
consistency, and extraction confidence. Calibrate confidence scores so that experiments
marked 0.9 confidence are correct 90% of the time (well-calibrated).

Acceptance Criteria

☐ Quality score computed from: field completeness, statistical rigor, internal consistency

☐ Completeness score: fraction of schema fields populated (weighted by importance)

☐ Statistical rigor: presence of p-values, confidence intervals, effect sizes, sample sizes

☐ Consistency: results match conclusions, effect direction matches measurements

☐ Confidence calibration: sample-verify 50 extractions, plot calibration curve

☐ Quality feeds into artifact quality_score via propagate_quality()

☐ Low-confidence extractions flagged for human review or re-extraction

☐ API: quality distribution dashboard showing extraction health

Dependencies

atl-ex-02-PIPE — Extraction pipeline produces the experiments to score
atl-ex-03-LINK — Entity linking quality is a scoring factor

Dependents

atl-ex-05-REPL — Replication tracking needs quality-filtered experiments

Work Log

2026-04-26 07:00 PT — Slot 77

Staleness review: Task valid — extraction pipeline exists (atl-ex-02-PIPE), entity linking exists (atl-ex-03-LINK), but no quality scoring system in place
Schema analysis: experiments table has 34 columns, no quality_score. Artifact metadata has statistical_evidence, sample_size, replication_status, extraction_metadata, ambiguities
Quality scoring design:

- Extract quality components from experiment artifact metadata JSON
- completeness_score: weighted fraction of schema fields populated
- statistical_rigor_score: presence of p-values, CIs, effect sizes, sample sizes
- internal_consistency_score: results vs conclusions, effect direction check
- aggregate quality_score = 0.4completeness + 0.35statistical_rigor + 0.25*consistency

Approach: New module scidex/agora/extraction_quality.py with score_experiment(), calibrate_confidence(), propagate_quality_to_experiment(), get_quality_distribution()
API: New endpoint /api/experiments/quality-distribution returning histogram + stats

2026-04-26 08:30 PT — Commit and push

Implementation complete: scidex/agora/extraction_quality.py (700+ lines)

- compute_completeness_score(): weighted field population (40% weight)
- compute_statistical_rigor_score(): p-values, CIs, effect sizes, sample sizes (35% weight)
- compute_consistency_score(): results/conclusions alignment, effect direction (25% weight)
- aggregate_quality_score(): weighted combination of sub-scores
- score_experiment(): compute all sub-scores, update artifact + experiments table
- calibrate_confidence(): stratified calibration with recalibration hints
- propagate_quality_to_experiment(): update quality via artifact_registry's propagate_quality
- flag_low_confidence_experiments(): threshold-based flagging for human review
- get_quality_distribution(): dashboard stats from 647 experiments
- score_all_experiments(): batch scoring for experiments lacking quality_score

API endpoint: GET /api/experiments/quality-distribution returns health dashboard
Current state: 647 experiments scored, health_status=healthy, 7 low-confidence flagged
Commit: af6912a80 — [Atlas] Extraction quality scoring and confidence calibration [task:atl-ex-04-QUAL]
Pushed: orchestra/task/atl-ex-0-extraction-quality-scoring-and-confidenc
Note: Calibration analysis shows system needs recalibration (calibration_error=0.253, hints about [0.5-0.7] bin being under-confident) — this is a known finding, not a bug

Payload JSON

{
  "requirements": {
    "analysis": 5
  }
}

Sibling Tasks in Quest (Experiment Extraction) ↗

○[Atlas] CI: Verify experiment extraction quality metrics and extract from new papersP88

✓[Atlas] Define experiment extraction schemas per experiment typeP93

✓[Atlas] Auto-link extracted experiments to KG entitiesP93

✓[Atlas] Backfill 188 existing experiment artifacts with structured metadataP93

✓[Atlas] Build LLM extraction pipeline from paper abstracts and full textP92

✓[Atlas] API endpoints for experiment browsing, search, and filteringP87

✓[Atlas] Replication tracking — match experiments testing same hypothesisP86

✓[Atlas] Meta-analysis support — aggregate results across experimentsP84

[Atlas] Extraction quality scoring and confidence calibration done analysis:5