Effort: deep
Background
v1 spec q-synth-disease-landing (Apr 2026) called for entity-link
joins to drive the per-disease landing page. The implementation
shipped naive text matching against hypotheses.disease /
analyses.domain / challenges.domain. Result: /disease/<slug> was
empty for any disease whose content was tagged with a broad-area label
("neurodegeneration", "neuroinflammation") rather than the
specific disease name — i.e. ~80% of the catalog including all 250
non-ND disease landing pages from the 9ca528d3a fan-out.
v1 PR #1386 broadened the live queries with ontology-synonym expansion
as an interim fix. PR #1393 added a one-shot Python backfill script,
which never ran before the 2026-05-13 v1 freeze. The original
Phase 4 spec (q-disease-canonical-tagging_spec.md) targeted
api.py:73829 for the JOIN refactor — that path is now closed.
This spec retargets the durable fix at v2. See
docs/planning/2026-05-10-notebook-disease-recap.md for the full
forensic context (two-design-generation history, the empty join table,
the 250 orphaned dashboards).
Goal
In v2 substrate (SciDEX-AI/SciDEX-Substrate), make disease tagging
on new content automatic and machine-checkable so the per-disease
landing page is built from indexed entity↔MONDO joins, never
free-text fragments.
Acceptance Criteria
☐ Substrate database has a canonical-disease join table
(analogue of v1
entity_disease_canonical) with
(entity_id,
entity_type, mondo_id, confidence, resolved_at) and indexes on
mondo_id and
entity_id.
☐ Substrate database has a disease_ontology_catalog analogue
seeded from MONDO + synonyms (port the existing v1 seed query
or re-pull from the source). At minimum:
mondo_id,
label,
synonyms (jsonb),
vertical, plus the existing v1 cross-walk
columns.
☐ A substrate FastAPI route GET /api/diseases/resolve?q=<text>
returns up to 5
(mondo_id, label, confidence) candidates by
synonym match. Exact label > exact synonym > word-boundary
substring match.
☐ Substrate's hypothesis / analysis / challenge / gap insert hooks
call the resolver on the way in and write the best match to the
join table. No PG triggers — the cleaner pattern is a substrate
service hook so the resolver lives in code, not PL/SQL.
☐ Substrate's /disease/<slug> (or whatever the canonical v2 path
is) builds every section from a JOIN through the canonical
table on
mondo_id. No ILIKE fallbacks; tagging is enforced at
write time so the join is exact.
☐ Substrate agent prompts (whichever path produces hypotheses /
analyses) require a MONDO ID or exact catalog label in the
disease field. The prompt links agents to the resolver
endpoint.
☐ An integration test seeds a non-ND disease (e.g. breast-
cancer) end-to-end and asserts the disease page renders ≥1
row in every panel from the join, not from text fallback.
☐ A quality-dashboard tile shows
entity_disease_canonical_size and median resolution age.
Approach
Resolver service. Implement scidex_substrate/atlas/
disease_resolver.py with
resolve(text: str, top_k: int = 5) ->
list[ResolverHit]. Build the synonym index lazily on first call,
cache in-process. Reference implementation: the index-build code
in v1's
scripts/backfill_entity_disease_canonical.py.
Insert hooks. Wherever substrate writes a hypothesis /
analysis / challenge / gap, call the resolver on the
disease/domain/title/description payload and write the top match
to the canonical join table in the same transaction. Skip
substrate triggers / PL/Python — keep the logic in app code.
Page refactor. Substrate's disease landing handler joins
through the canonical table from line 1. No synonym-expansion
fallback — write-time tagging is enforced.
Agent prompts. Add a "Disease tagging" paragraph to substrate's
debate / analysis / hypothesis prompts naming the resolver
endpoint and the MONDO-label requirement.
One-time backfill of inherited v1 rows. When substrate imports
v1 content (if/when that happens), run the resolver on each
imported row and populate the join table as part of the import
pipeline. The v1 script
scripts/backfill_entity_disease_canonical.py is the
reference algorithm — port it to substrate Python, do not run
the v1 version against v1 DB.
Dependencies
- v1 PR #1386 — interim Phase 1 broadening (kept in v1's serving
window, no v2 dependency)
- v1 PR #1393 — script + this spec's predecessor; reference only
- v2 substrate's hypothesis / analysis / challenge insert paths
(whatever modules own those writes today)
- MONDO ontology data source (whatever substrate uses for ontology
catalogs)
Non-Goals
- Reviving the 250 v1
disease-landing-<slug> dashboard artifacts.
They're keyed to v1 entities and an obsolete
view_spec_json format. Substrate seeds its own dashboards (or omits the duplicate
layer entirely — the broadened page already does the synthesis).
- Backfilling v1
entity_disease_canonical. v1 is frozen; the
Phase 1 broadened queries work well enough for v1's remaining
serving window.
- Migrating the 30+ legacy v1
scripts/backfill_*.py files to
scripts/oneoff/. Out of scope; v2 starts with the new convention.
Work Log
(empty — to be filled in by the substrate implementer)