[Atlas] Canonical disease tagging — close the entity_disease_canonical loop

Effort: medium

> SUPERSEDED 2026-05-20. This spec targeted api.py:73829
> (disease_landing_page) and the v1 entity_disease_canonical
> table. v1 was frozen 2026-05-13 (see AGENTS.md §"v1 FROZEN");
> the api.py JOIN refactor described here cannot be implemented.
> See q-disease-canonical-tagging-v2_spec.md for the v2
> substrate equivalent and 2026-05-10-notebook-disease-recap.md
> for the session context.
>
> v1's Phase 1 broadening (PR #1386) remains live in v1's serving
> window and is the known-good interim behavior. The Phase 3
> backfill script at scripts/backfill_entity_disease_canonical.py
> is forensic / reference material only — do not run against v1.

Background

q-synth-disease-landing_spec.md (Apr 2026) called for entity-link joins
to drive the per-disease landing page. The cherry-pick that shipped
(d734ea71d / f2103dbc4) skipped the entity-link audit and used naive
text matching against hypotheses.disease, analyses.domain, challenges.domain. Result: /disease/<slug> is empty for any disease
whose content was tagged with a broad-area label
("neurodegeneration", "neuroinflammation") rather than the specific
disease name — i.e. ~80% of the catalog including all 250 non-ND disease
landing pages from the 9ca528d3a fan-out.

Phase 1 (PR #1386) broadened the live queries with ontology-synonym
expansion as an interim fix. Phase 3 (separate PR) added scripts/backfill_entity_disease_canonical.py to populate entity_disease_canonical from existing text fields. This task closes
the remaining gaps so we never silently lose disease tagging again.

Goal

Make disease tagging on new content automatic and machine-checkable, so
the disease landing page works without future backfills.

Acceptance Criteria

☐ entity_disease_canonical is non-empty (≥1000 rows after the

Phase 3 backfill) and growing. Confirmed via a daily SQL count
tracked on /atlas/quality.

☐ Agent prompts that produce hypotheses, analyses, challenges,

and knowledge_gaps rows are updated to request a MONDO ID (or
exact catalog label) in the disease / domain field. The prompt
must mention disease_ontology_catalog and link agents to a
lookup endpoint (/api/diseases/resolve?q=<text> — to build).

☐ New /api/diseases/resolve?q=<text> endpoint returns up to 5

(mondo_id, label, confidence) candidates by reusing the
synonym-index built in scripts/backfill_entity_disease_canonical.py.
Used by both agent prompts and the wiki authoring UI.

☐ On INSERT or UPDATE of hypotheses / analyses / challenges /

knowledge_gaps, a trigger calls the resolver and writes the
best-match (entity_id, mondo_id, confidence) into
entity_disease_canonical — single-disease tagging done at the
database layer, no agent cooperation required.

☐ disease_landing_page (api.py:73829) switches its WHERE

clauses from synonym-expanded ILIKE OR-trees to JOIN through
entity_disease_canonical. Synonym-expansion stays as a fallback
for entities the resolver couldn't match (confidence < 0.4).

☐ tests/test_disease_landing.py extends coverage with one

non-ND disease (e.g. breast-cancer) and asserts ≥1 row in every
panel.

Approach

Resolver endpoint. Lift the synonym-index code out of

scripts/backfill_entity_disease_canonical.py into
scidex/atlas/disease_resolver.py. Expose
resolve(text: str, top_k: int = 5) -> list[(mondo_id, label, conf)].
Wire /api/diseases/resolve in api.py.

Write-time trigger. PG function f_canonicalize_disease() that

on AFTER INSERT/UPDATE of the four tables:
- Extracts disease/domain + title (+ description if present)
- Calls the resolver via PL/Python OR pushes a job onto a queue table
that a background worker drains every 60s.
- The queue approach is preferred — keeps the write path fast and
resilient.

Agent prompts. Edit .orchestra/prompt.md and the

debate/analysis/hypothesis generation prompts under scidex/agora/
to add a "Disease tagging" section: "Set disease to the exact
label from disease_ontology_catalog matching the entity. If
unsure, call /api/diseases/resolve?q=<your-text> and use the
highest-confidence match. Use the canonical English label, not the
MONDO ID."

Query refactor. In disease_landing_page, replace each

WHERE col ILIKE ANY(...) with

WHERE id IN (
     SELECT entity_id FROM entity_disease_canonical WHERE mondo_id = $1
   )

. Pre-resolve the slug → mondo_id at the top of the function.
Keep the synonym-expanded version as a fallback so the page
doesn't go dark while the canonical table is filling in.

Quality dashboard. Add entity_disease_canonical_size and

entity_disease_canonical_freshness (median age) metrics to
build_quality_dashboard() and surface on /atlas/quality.

Dependencies

PR #1386 (Phase 1 query broadening) — landed
Phase 3 backfill PR (in flight in the same task chain) — populates

the initial table

Plural-routes PR (separate, in flight) — once landed, route paths

may flip from /disease/<slug> to /diseases/<slug>; update
references in the resolver + dashboard tiles.

Work Log

(to be filled in by the implementer)

File: q-disease-canonical-tagging_spec.md

Modified: 2026-05-20 18:04

Size: 5.5 KB