Machine learning-based identification of C1Q hub genes

Backfill, Etl V1

🧫

Machine learning-based identification of C1Q hub genes

active

experiment Created: 2026-04-06T12:27:51 By: etl-v1-backfill Quality: 50% ✓ SciDEX ID: exp-9bf7bb8f-2c69-4a0c-880d-2f9603069dd3

🧫 Experiment Protocol ExploratoryAtherosclerosisC1QA, C1QCHuman bulk RNA sequencing datasetsproposed

This experiment employed multiple machine learning algorithms including Gradient Boosting Machine (GBM), LASSO regression, and XGBoost to identify key C1Q-related hub genes from bulk RNA sequencing data. Seven C1Q-associated differentially expressed genes were initially identified from both single-cell and bulk RNA datasets. Through the application of these three complementary machine learning approaches, C1QA and C1QC were selected as the most significant hub genes. The researchers then developed diagnostic models using generalized linear models and validated their performance through receiver operating characteristic (ROC) curve analysis to assess the ability to distinguish between different types of atherosclerosis.

PRIMARY OUTCOME

Identification of C1QA and C1QC as key hub genes

EXPECTED OUTCOMES

- 1. Primary: C1QA and C1QC identified as top 2 hub genes with combined importance scores >0.8 across all three ML algorithms - 2. Training performance: Combined C1QA+C1QC model achieves AUC >0.85 (95% CI: 0.80-0.90) in training cohort cross-validation - 3. Validation performance: Model maintains AUC >0.75 (95% CI: 0.70-0.85) across ≥2 independent validation cohorts - 4. Algorithm consistency: C1QA and C1QC rank within top 3 features for ≥2 of 3 machine learning algorithms - 5. Clinical utility: Optimal probability threshold achieves sensitivity >80% and specificity >70% for atherosclerosis detection - 6. Model stability: <10% variation in AUC across 10-fold cross-validation iterations, indicating robust performance - 7. Feature importance: C1QA and C1QC show complementary predictive patterns with correlation coefficient <0.7, supporting dual selection

SUCCESS CRITERIA

- • Statistical significance: Model AUC significantly greater than 0.5 (p < 0.001) in both training and validation cohorts - • Clinical threshold: Combined model AUC >0.75 with lower 95% confidence interval >0.65 in primary validation cohort - • Cross-algorithm consistency: Selected hub genes rank in top 50% of importance for ≥2 of 3 machine learning methods - • Model calibration: Hosmer-Lemeshow test p-value >0.05 indicating good calibration between predicted and observed outcomes - • External validation: Model performance maintained across ≥2 independent cohorts with AUC difference <0.1 from training - • Data completeness: ≥80% of samples successfully processed through all analysis phases with <5% missing key variables - • Reproducibility: Results stable across different random seeds and data partitioning strategies with <5% AUC variation

PROTOCOL

**Phase 1: Dataset Preparation and Feature Selection** — Days 1-4
Acquire bulk RNA sequencing datasets for atherosclerosis from GEO database including training cohorts (GSE100927, GSE28829) and validation cohorts (GSE57691, GSE120521). Download normalized expression matrices and clinical metadata. Merge datasets using ComBat batch effect correction (sva package in R). Filter genes with low variance (coefficient of variation <0.1) and low expression (mean TPM <1). From single-cell analysis results, extract the 7 C1Q-associated differentially expressed genes identified in previous experiments. Verify gene expression distribution and check for missing values across all datasets. Perform log2 transformation and z-score normalization for machine learning compatibility.

**Phase 2: Machine Learning Model Development** — Days 5-10
Split training data into 80% training and 20% internal validation sets using stratified sampling to maintain class balance. Implement three complementary machine learning algorithms: 1) Gradient Boosting Machine (GBM) using gbm package with parameters: n.trees=1000, interaction.depth=3, shrinkage=0.01, cross-validation folds=10; 2) LASSO regression using glmnet package with alpha=1, lambda determined by 10-fold cross-validation; 3) XGBoost using xgboost package with nrounds=100, max_depth=6, eta=0.3, subsample=0.8. For each algorithm, perform hyperparameter tuning using grid search with 5-fold cross-validation. Evaluate feature importance scores from each model and rank the 7 C1Q-related genes.

**Phase 3: Hub Gene Selection and Model Integration** — Days 11-13
Combine feature importance rankings from all three algorithms using rank aggregation methods (RankAggreg package). Calculate ensemble importance scores by weighted average of normalized importance scores from each algorithm (equal weights initially). Select top-ranking genes based on consistency across algorithms - genes must rank in top 50% for at least 2 of 3 algorithms. Validate selection using recursive feature elimination (RFE) with 10-fold cross-validation to identify optimal gene subset. Focus analysis on C1QA and C1QC as primary hub genes based on combined ranking scores and biological relevance.

**Phase 4: Diagnostic Model Development and Validation** — Days 14-17
Develop generalized linear models (GLM) using selected hub genes (C1QA and C1QC) as predictors for atherosclerosis classification. Create multiple model variants: 1) Individual gene models (C1QA only, C1QC only), 2) Combined gene model (C1QA + C1QC), 3) Extended model including interaction terms. Use binomial family with logit link function for binary classification. Train models on full training dataset and validate on held-out internal validation set. Perform 10-fold cross-validation to assess model stability and calculate confidence intervals for performance metrics.

**Phase 5: ROC Analysis and External Validation** — Days 18-20
Generate receiver operating characteristic (ROC) curves for all model variants using pROC package in R. Calculate area under the curve (AUC), sensitivity, specificity, positive predictive value, and negative predictive value with 95% confidence intervals using bootstrapping (n=2000 iterations). Determine optimal probability thresholds using Youden's J statistic. Validate final models on external validation cohorts, ensuring no overlap with training data. Perform calibration analysis using Hosmer-Lemeshow test and calibration plots to assess prediction reliability. Compare model performance between training and validation cohorts using DeLong's test.

**Phase 6: Model Performance Assessment and Clinical Utility** — Days 21-23
Evaluate clinical utility using decision curve analysis to determine net benefit across probability thresholds. Calculate number needed to diagnose (NND) and likelihood ratios for positive and negative results. Perform subgroup analyses stratified by age, sex, and atherosclerosis severity when metadata available. Generate prediction nomograms for clinical application using rms package. Conduct sensitivity analyses by testing model performance with different probability cutoffs and evaluating robustness to missing data. Create comprehensive performance reports with forest plots showing AUC values across all validation cohorts and confidence intervals.

LINKED HYPOTHESES

Source: PMID 38179058 ↗

🧫 Experiment Extras

PATHWAY

Complement signaling pathway

MARKET PRICE

$0.50

STATUS

proposed

▸Metadataorigin_type: v1_polymorphic_backfill

origin_type	v1_polymorphic_backfill
source_table	experiments
_schema_version	1

📊 Evidence Profile

Evidence Balance

+0%

Certainty

0%

Debates

0

Incoming

0

Outgoing

0

0 supporting 0 contradicting 0 neutral

View full evidence profile →

Public annotations (0)Annotate on Hypothes.is →

No public annotations yet.

📗 Cite This Artifact

Machine learning-based identification of C1Q hub genes

💬 Discussion