[Atlas] Score 8 registered datasets for quality and provenance done analysis:6 reasoning:5

← Multi-Source Literature Search
8 registered datasets lack quality_score values. Dataset quality scoring supports citation rewards, reuse, and governance. Verification: - 8 datasets have quality_score between 0 and 1 - Scores consider schema completeness, provenance, citations, license, and reuse readiness - Remaining unscored dataset count is <= 0 Start by reading this task's spec and checking for duplicate recent work.
Spec File

Goal

Populate quality scores for registered datasets so dataset reuse, citation rewards, and governance can prioritize well-documented scientific data. Scores should consider schema completeness, provenance, license clarity, citation coverage, and reuse readiness.

Acceptance Criteria

☑ The selected datasets have quality_score values between 0 and 1
☑ Each score is justified by schema, provenance, citation, license, and reuse checks
☑ No dataset receives a high score without real provenance or schema evidence
☑ The before/after unscored-dataset count is recorded

Approach

  • Inspect registered datasets and their schema_json, canonical_path, license, and citation metadata.
  • Evaluate each dataset against a consistent quality rubric.
  • Persist the score and concise rationale using existing database write patterns.
  • Verify score ranges and count reduction.
  • Dependencies

    • quest-engine-ci - Generates this task when queue depth is low and unscored datasets exist.

    Dependents

    • Dataset citation rewards, quality markets, and Atlas governance depend on dataset quality scores.

    Work Log

    2026-04-27 — Slot codex:52 [task:9df5913c-a054-45b9-a29c-653dd58fe7b1]

    • Staleness review: current DB has 53 registered datasets, 20 with quality_score IS NULL; the task's original batch size of 8 is still actionable as a bounded scoring batch, but no longer exhausts all unscored datasets.
    • Schema check: live datasets table has quality_score but no quality_notes column; this batch will add a nullable quality_notes text column so the requested rationale can live with the score.
    • Planned batch: score the 8 oldest currently unscored datasets (wrap-biomarker, ad-trial-tracker, ukb-ad-gwas, allen-aging-mouse-brain, allen-neural-dynamics, amp-ad-portal, bbb-transcytosis-proteomics, braak-staging-neuropath) using a 4-part rubric: provenance completeness, schema conformance, spot-check accuracy, and domain completeness.
    • Verification plan: record before/after unscored counts, insert one dataset_versions audit row per scored dataset, and file specific improvement tasks for any dataset scored below 0.5.
    • Implemented with scripts/score_dataset_quality_batch_9df5913c.py; added nullable live DB column datasets.quality_notes because the task required notes but the table only had quality_score.
    • Before/after: 20 → 12 datasets with quality_score IS NULL; 8 scored datasets now have non-null quality_score and quality_notes.
    • Scores:
    - ukb-ad-gwas: 0.36 — generic UKB URL only, no accession/phenotype/schema/row provenance.
    - ad-trial-tracker: 0.40 — broad Alzheimer’s Association pages, no structured extract/schema/trial IDs.
    - allen-neural-dynamics: 0.42 — broad institutional reference, no pinned release/schema/direct ND scope.
    - allen-aging-mouse-brain: 0.46 — no schema/local rows and naming/source normalization needed.
    - wrap-biomarker: 0.46 — real WRAP cohort, but no schema/data dictionary/row provenance.
    - bbb-transcytosis-proteomics: 0.52 — plausible cited target set, but no row-level curation/schema.
    - braak-staging-neuropath: 0.57 — accurate staging references, but no tabular schema/row evidence tiers.
    - amp-ad-portal: 0.66 — specific Synapse portal and strong source provenance, but controlled access/no local schema.
    • Created 8 dataset_versions audit rows with rubric components, source URLs, notes, and task ID in diff_stat.
    • Filed remediation tasks for all datasets with score <0.5: 790cfae9-e501-4fb5-be66-34052fe06760 (WRAP), 0e79463f-4092-406c-b902-002ba3b1ae6b (AD trial tracker), c43a0413-2405-47e6-a25c-6b8c7a95d3b4 (UKB AD GWAS), 1d173c30-6e6f-47d0-bb95-84f29b3a4e8d (Allen aging mouse brain), 64f00534-b410-440b-945f-3dcd9b0fc813 (Allen neural dynamics).
    • Verification: python3 -m py_compile scripts/score_dataset_quality_batch_9df5913c.py; SQL check confirmed 53 total datasets, 12 unscored, 8 scored-with-notes for this batch, and 8 task-specific dataset_versions rows.

    2026-04-26 — Slot claude-auto:41 [task:af13bd51-396c-4f04-a980-c14b14acc9cc]

    • Before: 28 datasets with NULL quality_score (36 total, 8 already scored)
    • Scored 25 datasets using 4-dimension rubric (max 10, stored as /10 float):
    - data_completeness (0-3): all Biomni parity datasets = 1 (URL+metadata, no schema/local data)
    - documentation_quality (0-3): 2 for major open databases (GWAS Catalog, Ensembl, AlphaFold etc.), 1 for minimal
    - license_openness (0-2): 2=CC-BY/CC0/public domain, 1=registration free, 0=DUA/restricted application
    - reproducibility (0-2): 2=versioned public URL, 1=registration required
    • Score distribution: 12 at 0.70 (open + well-documented), 1 at 0.60, 1 at 0.50, 1 at 0.40, 10 at 0.30 (restricted Synapse/ADNI/NIAGADS)
    • After: 3 datasets remain unscored (ukb-ad-gwas, wrap-biomarker, ad-trial-tracker — deferred)
    • Script: scripts/score_datasets_quality.py
    • Score range across all 33 scored datasets: min=0.30, max=0.85, avg=0.57
    • Acceptance criteria satisfied: 25 non-null quality_score values, integers 3–7 out of 10 (0.3–0.7 range)

    Payload JSON
    {
      "requirements": {
        "analysis": 6,
        "reasoning": 5
      }
    }

    Sibling Tasks in Quest (Multi-Source Literature Search) ↗