[v2 Substrate] Canonical disease tagging — entity ↔ MONDO at write-time

← All Specs

Effort: deep

Background

v1 spec q-synth-disease-landing (Apr 2026) called for entity-link
joins to drive the per-disease landing page. The implementation
shipped naive text matching against hypotheses.disease / analyses.domain / challenges.domain. Result: /disease/<slug> was
empty for any disease whose content was tagged with a broad-area label
("neurodegeneration", "neuroinflammation") rather than the
specific disease name — i.e. ~80% of the catalog including all 250
non-ND disease landing pages from the 9ca528d3a fan-out.

v1 PR #1386 broadened the live queries with ontology-synonym expansion
as an interim fix. PR #1393 added a one-shot Python backfill script,
which never ran before the 2026-05-13 v1 freeze. The original
Phase 4 spec (q-disease-canonical-tagging_spec.md) targeted api.py:73829 for the JOIN refactor — that path is now closed.

This spec retargets the durable fix at v2. See docs/planning/2026-05-10-notebook-disease-recap.md for the full
forensic context (two-design-generation history, the empty join table,
the 250 orphaned dashboards).

Goal

In v2 substrate (SciDEX-AI/SciDEX-Substrate), make disease tagging
on new content automatic and machine-checkable so the per-disease
landing page is built from indexed entity↔MONDO joins, never
free-text fragments.

Acceptance Criteria

☐ Substrate database has a canonical-disease join table
(analogue of v1 entity_disease_canonical) with (entity_id,
entity_type, mondo_id, confidence, resolved_at)
and indexes on
mondo_id and entity_id.
☐ Substrate database has a disease_ontology_catalog analogue
seeded from MONDO + synonyms (port the existing v1 seed query
or re-pull from the source). At minimum: mondo_id, label,
synonyms (jsonb), vertical, plus the existing v1 cross-walk
columns.
☐ A substrate FastAPI route GET /api/diseases/resolve?q=<text>
returns up to 5 (mondo_id, label, confidence) candidates by
synonym match. Exact label > exact synonym > word-boundary
substring match.
☐ Substrate's hypothesis / analysis / challenge / gap insert hooks
call the resolver on the way in and write the best match to the
join table. No PG triggers — the cleaner pattern is a substrate
service hook so the resolver lives in code, not PL/SQL.
☐ Substrate's /disease/<slug> (or whatever the canonical v2 path
is) builds every section from a JOIN through the canonical
table on mondo_id. No ILIKE fallbacks; tagging is enforced at
write time so the join is exact.
☐ Substrate agent prompts (whichever path produces hypotheses /
analyses) require a MONDO ID or exact catalog label in the
disease field. The prompt links agents to the resolver
endpoint.
☐ An integration test seeds a non-ND disease (e.g. breast-
cancer) end-to-end and asserts the disease page renders ≥1
row in every panel from the join, not from text fallback.
☐ A quality-dashboard tile shows
entity_disease_canonical_size and median resolution age.

Approach

  • Resolver service. Implement scidex_substrate/atlas/
  • disease_resolver.py with resolve(text: str, top_k: int = 5) ->
    list[ResolverHit]
    . Build the synonym index lazily on first call,
    cache in-process. Reference implementation: the index-build code
    in v1's scripts/backfill_entity_disease_canonical.py.

  • Insert hooks. Wherever substrate writes a hypothesis /
  • analysis / challenge / gap, call the resolver on the
    disease/domain/title/description payload and write the top match
    to the canonical join table in the same transaction. Skip
    substrate triggers / PL/Python — keep the logic in app code.

  • Page refactor. Substrate's disease landing handler joins
  • through the canonical table from line 1. No synonym-expansion
    fallback — write-time tagging is enforced.

  • Agent prompts. Add a "Disease tagging" paragraph to substrate's
  • debate / analysis / hypothesis prompts naming the resolver
    endpoint and the MONDO-label requirement.

  • One-time backfill of inherited v1 rows. When substrate imports
  • v1 content (if/when that happens), run the resolver on each
    imported row and populate the join table as part of the import
    pipeline. The v1 script
    scripts/backfill_entity_disease_canonical.py is the
    reference algorithm — port it to substrate Python, do not run
    the v1 version against v1 DB.

    Dependencies

    • v1 PR #1386 — interim Phase 1 broadening (kept in v1's serving
    window, no v2 dependency)
    • v1 PR #1393 — script + this spec's predecessor; reference only
    • v2 substrate's hypothesis / analysis / challenge insert paths
    (whatever modules own those writes today)
    • MONDO ontology data source (whatever substrate uses for ontology
    catalogs)

    Non-Goals

    • Reviving the 250 v1 disease-landing-<slug> dashboard artifacts.
    They're keyed to v1 entities and an obsolete view_spec_json
    format. Substrate seeds its own dashboards (or omits the duplicate
    layer entirely — the broadened page already does the synthesis).
    • Backfilling v1 entity_disease_canonical. v1 is frozen; the
    Phase 1 broadened queries work well enough for v1's remaining
    serving window.
    • Migrating the 30+ legacy v1 scripts/backfill_*.py files to
    scripts/oneoff/. Out of scope; v2 starts with the new convention.

    Work Log

    (empty — to be filled in by the substrate implementer)

    File: q-disease-canonical-tagging-v2_spec.md
    Modified: 2026-05-20 18:04
    Size: 5.6 KB