[Atlas] Canonical disease tagging — close the entity_disease_canonical loop

← All Specs

Effort: medium

> SUPERSEDED 2026-05-20. This spec targeted api.py:73829
> (disease_landing_page) and the v1 entity_disease_canonical
> table. v1 was frozen 2026-05-13 (see AGENTS.md §"v1 FROZEN");
> the api.py JOIN refactor described here cannot be implemented.
> See q-disease-canonical-tagging-v2_spec.md for the v2
> substrate equivalent and 2026-05-10-notebook-disease-recap.md
> for the session context.
>
> v1's Phase 1 broadening (PR #1386) remains live in v1's serving
> window and is the known-good interim behavior. The Phase 3
> backfill script at scripts/backfill_entity_disease_canonical.py
> is forensic / reference material only — do not run against v1.

Background

q-synth-disease-landing_spec.md (Apr 2026) called for entity-link joins
to drive the per-disease landing page. The cherry-pick that shipped
(d734ea71d / f2103dbc4) skipped the entity-link audit and used naive
text matching against hypotheses.disease, analyses.domain, challenges.domain. Result: /disease/<slug> is empty for any disease
whose content was tagged with a broad-area label
("neurodegeneration", "neuroinflammation") rather than the specific
disease name — i.e. ~80% of the catalog including all 250 non-ND disease
landing pages from the 9ca528d3a fan-out.

Phase 1 (PR #1386) broadened the live queries with ontology-synonym
expansion as an interim fix. Phase 3 (separate PR) added scripts/backfill_entity_disease_canonical.py to populate entity_disease_canonical from existing text fields. This task closes
the remaining gaps so we never silently lose disease tagging again.

Goal

Make disease tagging on new content automatic and machine-checkable, so
the disease landing page works without future backfills.

Acceptance Criteria

entity_disease_canonical is non-empty (≥1000 rows after the
Phase 3 backfill) and growing. Confirmed via a daily SQL count
tracked on /atlas/quality.
☐ Agent prompts that produce hypotheses, analyses, challenges,
and knowledge_gaps rows are updated to request a MONDO ID (or
exact catalog label) in the disease / domain field. The prompt
must mention disease_ontology_catalog and link agents to a
lookup endpoint (/api/diseases/resolve?q=<text> — to build).
☐ New /api/diseases/resolve?q=<text> endpoint returns up to 5
(mondo_id, label, confidence) candidates by reusing the
synonym-index built in scripts/backfill_entity_disease_canonical.py.
Used by both agent prompts and the wiki authoring UI.
☐ On INSERT or UPDATE of hypotheses / analyses / challenges /
knowledge_gaps, a trigger calls the resolver and writes the
best-match (entity_id, mondo_id, confidence) into
entity_disease_canonical — single-disease tagging done at the
database layer, no agent cooperation required.
disease_landing_page (api.py:73829) switches its WHERE
clauses from synonym-expanded ILIKE OR-trees to JOIN through
entity_disease_canonical. Synonym-expansion stays as a fallback
for entities the resolver couldn't match (confidence < 0.4).
tests/test_disease_landing.py extends coverage with one
non-ND disease (e.g. breast-cancer) and asserts ≥1 row in every
panel.

Approach

  • Resolver endpoint. Lift the synonym-index code out of
  • scripts/backfill_entity_disease_canonical.py into
    scidex/atlas/disease_resolver.py. Expose
    resolve(text: str, top_k: int = 5) -> list[(mondo_id, label, conf)].
    Wire /api/diseases/resolve in api.py.

  • Write-time trigger. PG function f_canonicalize_disease() that
  • on AFTER INSERT/UPDATE of the four tables:
    - Extracts disease/domain + title (+ description if present)
    - Calls the resolver via PL/Python OR pushes a job onto a queue table
    that a background worker drains every 60s.
    - The queue approach is preferred — keeps the write path fast and
    resilient.

  • Agent prompts. Edit .orchestra/prompt.md and the
  • debate/analysis/hypothesis generation prompts under scidex/agora/
    to add a "Disease tagging" section: "Set disease to the exact
    label from disease_ontology_catalog matching the entity. If
    unsure, call /api/diseases/resolve?q=<your-text> and use the
    highest-confidence match. Use the canonical English label, not the
    MONDO ID."

  • Query refactor. In disease_landing_page, replace each
  • WHERE col ILIKE ANY(...) with WHERE id IN (
    SELECT entity_id FROM entity_disease_canonical WHERE mondo_id = $1
    )
    . Pre-resolve the slug → mondo_id at the top of the function.
    Keep the synonym-expanded version as a fallback so the page
    doesn't go dark while the canonical table is filling in.

  • Quality dashboard. Add entity_disease_canonical_size and
  • entity_disease_canonical_freshness (median age) metrics to
    build_quality_dashboard() and surface on /atlas/quality.

    Dependencies

    • PR #1386 (Phase 1 query broadening) — landed
    • Phase 3 backfill PR (in flight in the same task chain) — populates
    the initial table
    • Plural-routes PR (separate, in flight) — once landed, route paths
    may flip from /disease/<slug> to /diseases/<slug>; update
    references in the resolver + dashboard tiles.

    Work Log

    (to be filled in by the implementer)

    File: q-disease-canonical-tagging_spec.md
    Modified: 2026-05-20 18:04
    Size: 5.5 KB