[v2 Substrate] Canonical disease tagging — entity ↔ MONDO at write-time

Effort: deep

Background

v1 spec q-synth-disease-landing (Apr 2026) called for entity-link
joins to drive the per-disease landing page. The implementation
shipped naive text matching against hypotheses.disease / analyses.domain / challenges.domain. Result: /disease/<slug> was
empty for any disease whose content was tagged with a broad-area label
("neurodegeneration", "neuroinflammation") rather than the
specific disease name — i.e. ~80% of the catalog including all 250
non-ND disease landing pages from the 9ca528d3a fan-out.

v1 PR #1386 broadened the live queries with ontology-synonym expansion
as an interim fix. PR #1393 added a one-shot Python backfill script,
which never ran before the 2026-05-13 v1 freeze. The original
Phase 4 spec (q-disease-canonical-tagging_spec.md) targeted api.py:73829 for the JOIN refactor — that path is now closed.

This spec retargets the durable fix at v2. See docs/planning/2026-05-10-notebook-disease-recap.md for the full
forensic context (two-design-generation history, the empty join table,
the 250 orphaned dashboards).

Goal

In v2 substrate (SciDEX-AI/SciDEX-Substrate), make disease tagging
on new content automatic and machine-checkable so the per-disease
landing page is built from indexed entity↔MONDO joins, never
free-text fragments.

Acceptance Criteria

☐ Substrate database has a canonical-disease join table

(analogue of v1 entity_disease_canonical) with

(entity_id,
      entity_type, mondo_id, confidence, resolved_at)

and indexes on
mondo_id and entity_id.

☐ Substrate database has a disease_ontology_catalog analogue

seeded from MONDO + synonyms (port the existing v1 seed query
or re-pull from the source). At minimum: mondo_id, label,
synonyms (jsonb), vertical, plus the existing v1 cross-walk
columns.

☐ A substrate FastAPI route GET /api/diseases/resolve?q=<text>

returns up to 5 (mondo_id, label, confidence) candidates by
synonym match. Exact label > exact synonym > word-boundary
substring match.

☐ Substrate's hypothesis / analysis / challenge / gap insert hooks

call the resolver on the way in and write the best match to the
join table. No PG triggers — the cleaner pattern is a substrate
service hook so the resolver lives in code, not PL/SQL.

☐ Substrate's /disease/<slug> (or whatever the canonical v2 path

is) builds every section from a JOIN through the canonical
table on mondo_id. No ILIKE fallbacks; tagging is enforced at
write time so the join is exact.

☐ Substrate agent prompts (whichever path produces hypotheses /

analyses) require a MONDO ID or exact catalog label in the
disease field. The prompt links agents to the resolver
endpoint.

☐ An integration test seeds a non-ND disease (e.g. breast-


      cancer

) end-to-end and asserts the disease page renders ≥1
row in every panel from the join, not from text fallback.

☐ A quality-dashboard tile shows

entity_disease_canonical_size and median resolution age.

Approach

Resolver service. Implement scidex_substrate/atlas/


   disease_resolver.py

with

resolve(text: str, top_k: int = 5) ->
   list[ResolverHit]

. Build the synonym index lazily on first call,
cache in-process. Reference implementation: the index-build code
in v1's scripts/backfill_entity_disease_canonical.py.

Insert hooks. Wherever substrate writes a hypothesis /

analysis / challenge / gap, call the resolver on the
disease/domain/title/description payload and write the top match
to the canonical join table in the same transaction. Skip
substrate triggers / PL/Python — keep the logic in app code.

Page refactor. Substrate's disease landing handler joins

through the canonical table from line 1. No synonym-expansion
fallback — write-time tagging is enforced.

Agent prompts. Add a "Disease tagging" paragraph to substrate's

debate / analysis / hypothesis prompts naming the resolver
endpoint and the MONDO-label requirement.

One-time backfill of inherited v1 rows. When substrate imports

v1 content (if/when that happens), run the resolver on each
imported row and populate the join table as part of the import
pipeline. The v1 script
scripts/backfill_entity_disease_canonical.py is the
reference algorithm — port it to substrate Python, do not run
the v1 version against v1 DB.

Dependencies

v1 PR #1386 — interim Phase 1 broadening (kept in v1's serving

window, no v2 dependency)

v1 PR #1393 — script + this spec's predecessor; reference only
v2 substrate's hypothesis / analysis / challenge insert paths

(whatever modules own those writes today)

MONDO ontology data source (whatever substrate uses for ontology

catalogs)

Non-Goals

Reviving the 250 v1 disease-landing-<slug> dashboard artifacts.

They're keyed to v1 entities and an obsolete view_spec_json
format. Substrate seeds its own dashboards (or omits the duplicate
layer entirely — the broadened page already does the synthesis).

Backfilling v1 entity_disease_canonical. v1 is frozen; the

Phase 1 broadened queries work well enough for v1's remaining
serving window.

Migrating the 30+ legacy v1 scripts/backfill_*.py files to

scripts/oneoff/. Out of scope; v2 starts with the new convention.

Work Log

(empty — to be filled in by the substrate implementer)

File: q-disease-canonical-tagging-v2_spec.md

Modified: 2026-05-20 18:04

Size: 5.6 KB