Effort: thorough
Build an immune-receptor repertoire pipeline for the immunology
vertical: ingest TCR/BCR sequencing FASTQ (or a precomputed AIRR
TSV), call clonotypes with MiXCR, compute repertoire diversity
(Shannon, Gini, Hill), match clonotypes against IEDB epitopes via the
new iedb_epitopes tool, and emit a clonotype-to-epitope linkage
artifact a debate can cite. Closes the immunology-vertical's biggest
data gap: SciDEX has no way to argue from actual repertoire data today.
Immunology hypotheses ("expanded autoreactive TCRs drive RA flares",
"hospital-acquired SARS-CoV-2 strains evade convalescent BCR
responses") need real repertoire evidence to be debate-grade. Without
a pipeline, the immunology Theorist has nothing to ground claims on
beyond text reviews. This pipeline absorbs publicly available AIRR
datasets (10X Genomics, ImmuneSpace, AIRR-DB) and turns them into
SciDEX artifacts the persona pack can argue from.
scidex/forge/immune_repertoire.py (≤700 LoC):ingest(source) — accepts FASTQ paths, an AIRR-format TSV,call_clonotypes_mixcr(fastqs) — invokes MiXCR via subprocess;diversity_metrics(clones) — computes Shannon entropy, Gini,link_to_epitopes(clones) — callstools.iedb_epitopes per clonotype CDR3; returns matchespipeline(source, chain='TRB') — composes; commits artifactdata/scidex-artifacts/immune_repertoire/<run_id>/repertoire_run(run_id PRIMARY KEY, source_kind,tools.py registers immune_repertoire_pipeline(source,@log_tool_call.
/artifacts/<id> renders a clonotype-frequency rank plotq-vert-vertical-personas-pack) consumes arepertoire_block when a debate's hypothesis names a diseasetests/test_immune_repertoire.py — synthetic AIRR tabledocs/setup/mixcr.md. Subprocess wrapper handles installed +python-Levenshtein (lightweight).iedb_epitopes fromq-vert-vertical-evidence-providers.
q-vert-vertical-personas-pack — immunology-expert consumer.data/scidex-artifacts/ submodule.scidex/forge/immune_repertoire.py (~480 LoC) with:ingest(source) — accepts FASTQ paths, AIRR TSV, or 10X cellranger dircall_clonotypes_mixcr(fastqs, chain) — subprocess MiXCR wrapper; AIRR output parsed to list of dictsload_airr_tsv(path) — parses AIRR-format TSV clonotypes filediversity_metrics(clones) — Shannon entropy, Gini coefficient, Hill numbers (q=1, q=2), Chao1 estimator in pure NumPylink_to_epitopes(clones, chain) — Levenshtein matching (≤2=match, ≤4=candidate); short-circuits on empty clones to avoid circular importpipeline(source, chain) — full pipeline: ingest → MiXCR → diversity → epitope linkage → artifact commit → DB persist; writes clones.tsv, diversity.json, epitope_linkage.csv, metadata.json under data/scidex-artifacts/immune_repertoire/<run_id>/_levenshtein_distance_py() — pure-Python fallback (no C extension required)_persist_run() — writes repertoire_run row with ON CONFLICT upsert
migrations/021_add_repertoire_run_table.py — repertoire_run table with PK, chain CHECK, indexesimmune_repertoire_pipeline() to scidex/forge/tools.py with @log_tool_call decorator and TOOL_NAME_MAPPING entrytests/test_immune_repertoire.py — 23 passing tests covering ingest, diversity metrics (6 cases), AIRR TSV loading, Levenshtein matching logic (5 cases), fallback Levenshtein (6 cases)iedb_epitopes is imported locally inside link_to_epitopes() and after the empty-clones short-circuit to avoid triggering forge_tools.py's instrumentation block at import time. This is a pre-existing latent issue in tools.py (the pathway_flux_pipeline instrumentation references a function defined after it); the short-circuit fix is the right solution.