Benchmarking omics-based prediction of asthma development in children

Wang, Xu-Wen; Wang, Tong; Schaub, Darius P.; Chen, Can; Sun, Zheng; Ke, Shanlin; Hecker, Julian; Maaser-Hecker, Anna; Zeleznik, Oana A.; Zeleznik, Roman; Litonjua, Augusto A.; DeMeo, Dawn L.; Lasky-Su, Jessica; Silverman, Edwin K.; Liu, Yang-Yu; Weiss, Scott T.

doi:10.1186/s12931-023-02368-8

Research
Open access
Published: 26 February 2023

Benchmarking omics-based prediction of asthma development in children

Xu-Wen Wang¹,
Tong Wang¹,
Darius P. Schaub²,
Can Chen¹,
Zheng Sun¹,
Shanlin Ke¹,
Julian Hecker¹,
Anna Maaser-Hecker³,
Oana A. Zeleznik¹,
Roman Zeleznik⁴,
Augusto A. Litonjua⁵,
Dawn L. DeMeo¹,
Jessica Lasky-Su¹,
Edwin K. Silverman¹,
Yang-Yu Liu^1,6 &
…
Scott T. Weiss¹

Respiratory Research volume 24, Article number: 63 (2023) Cite this article

2874 Accesses
3 Citations
2 Altmetric
Metrics details

Abstract

Background

Asthma is a heterogeneous disease with high morbidity. Advancement in high-throughput multi-omics approaches has enabled the collection of molecular assessments at different layers, providing a complementary perspective of complex diseases. Numerous computational methods have been developed for the omics-based patient classification or disease outcome prediction. Yet, a systematic benchmarking of those methods using various combinations of omics data for the prediction of asthma development is still lacking.

Objective

We aimed to investigate the computational methods in disease status prediction using multi-omics data.

Method

We systematically benchmarked 18 computational methods using all the 63 combinations of six omics data (GWAS, miRNA, mRNA, microbiome, metabolome, DNA methylation) collected in The Vitamin D Antenatal Asthma Reduction Trial (VDAART) cohort. We evaluated each method using standard performance metrics for each of the 63 omics combinations.

Results

Our results indicate that overall Logistic Regression, Multi-Layer Perceptron, and MOGONET display superior performance, and the combination of transcriptional, genomic and microbiome data achieves the best prediction. Moreover, we find that including the clinical data can further improve the prediction performance for some but not all the omics combinations.

Conclusions

Specific omics combinations can reach the optimal prediction of asthma development in children. And certain computational methods showed superior performance than other methods.

Background

Asthma is a chronic condition characterized by wheezing, coughing and reversible airflow obstruction [1]. The global prevalence, morbidity, mortality, and economic burden associated with asthma have been increasing in the past decades [2]. Advances in high-throughput sequencing technologies enable the availability of molecular assessments at the genome, epigenome, transcriptome, proteome, metabolome, and microbiome levels, providing the potential for a comprehensive understanding of human health and diseases [3,4,5,6]. Prediction of disease status, including asthma, is critical for understanding the etiology of the disease, discovering the molecular biomarkers and subsequentially identifying suitable interventions. Integrated approaches through combining multi-omics data from different biological layers might improve our ability to bridge the gap from genotype to phenotype [7,8,9,10].

Numerous computational methods have been developed to classify patients using their single- or multi-omics data. For example, ensemble-based methods, random forest, and gradient boost decision trees have shown superior performance over only using single-omics data or by directly concatenating the features from different omics data types for multi-omics classification tasks [11,12,13]. Moreover, several deep learning-based methods have been proposed for the classification in biomedical applications, generating higher performance than existing supervised multi-omics integration methods in various classification tasks [14, 15]. However, benchmarking those computational methods using various combinations of omics data for the disease status prediction has not been studied before. Note that for the disease status prediction, the omics data were collected before the disease onset, which is fundamentally different from the patient classification problem where the omics data were collected after the disease onset.

Here, we compared different disease status prediction methods (using standard performance metrics) on six different types of omics data collected in The Vitamin D Antenatal Asthma Reduction Trial (VDAART) cohort [16]. Our aim is to identify the best prediction method and the best combination of omics data for the prediction of asthma development (see Fig. 1). Our results indicate that Logistic Regression, Multi-Layer Perceptron, and Graph Neural Network-based method MOGONET display superior performance and the combinations of transcriptional, genomic and microbiome data can yield the best prediction of asthma development. Moreover, we found that including the clinical covariates can further improve the prediction performance for some (but not all) omics combinations.

Methods

VDAART cohort

VDAART is a clinical trial to examine the hypothesis that vitamin D supplementation in pregnant women will prevent the development of asthma and allergies in their children [17, 18]. Pregnant women between 18 and 40 years of age and at an estimated gestational age between 10 and 18 weeks were recruited at three clinical centers: Boston Medical Center, Washington University at Saint Louis, and Kaiser Permanente Southern California Region. In the VDAART study, six types of omics data of the children have been collected: (1) GWAS: genome-wide SNP genotyping data and genome-wide association study analysis results. Genotyping of children in VDAART was performed on the Illumina Infinium HumanOmniExpressExome BeadChip, and SNP genotypes are called using the Illumina GenCall software. (2) child miRNA (cord blood); (3) child mRNA transcriptomics (cord blood). Total RNA was isolated from samples by the Qiagen miRNAeasy Serum/Plasma extraction kit and QIAcube automation. Small RNA sequencing libraries were prepared using the Norgen Biotek Small RNA Library Prep Kit and then sequenced on the Illumina NextSeq 500 platform at 51 bp single-end reads. (4) child microbiome at 3–6 months. DNA extractions were performed on stool samples, and the bacterial 16S rRNA gene (V3 to V5 hypervariable regions) was amplified. (5) child metabolomics at 1 year. Nontargeted global metabolomic profiles were generated at Metabolon Inc. by using ultra-performance liquid chromatography–tandem mass spectroscopy (UPLC-MS/MS). (6) child DNA methylation data (cord blood). Cord blood and peripheral blood DNA using the Qiagen Puregene Kit (Valencia, CA, USA) and bisulfite converted using the EZ DNA Methylation-Gold Kit (Zymo Research, Irvine, CA, USA). We randomized samples by chips and plates and generated DNA methylation data using the Infinium HumanMethylation450 BeadChip (Illumina, San Diego, CA, USA).

Among the 748 child participants in VDAART, 102 participants (13.6%) have all the six types of omics data available. Among the 6 omics data types, GWAS data has the largest sample size (see Fig. 2). Postnatally, every 3 months, questionnaires administered to the mother by telephone up to the child’s third birthday inquired about the health of the infant and child, especially the occurrence of wheezing illnesses and asthma and allergy symptoms and diagnoses. In-person visit for the child obtained yearly questionnaire data, determined anthropometric measurements, and collected blood. Here, we applied various machine learning models to predict the children’s asthma status at year 3 using those six omics data collected at/before year 1. Assessment of asthma was based on a doctor’s diagnosis which was defined as a positive response to a direct question to the mother at any time in the first three years of the life of the child. As recent symptoms may help identify young children with significant asthma [19], a more specific definition of doctor’s diagnosis plus symptoms and medication use in the past was used. In addition, the following were also collected in the VDAART study: vitamin D levels in blood of both the mother (through measurement of 25(OH)D levels in cord blood at delivery) and the child (at year 1); and other relevant covariates, e.g., maternal asthma, race and clinical center (see Table 1 for characteristic).

Table 1 Key Characteristics of VDAART subjects used in benchmarking asthma development prediction

Full size table

Prediction methods and performance evaluation

We leveraged several classical classifiers in scikit-sklearn [20], i.e., k-Nearest Neighbors (KNN), Logistic Regression (LR), LRCV (Logistic Regression with cross-validator), Random Forest (RF), Multi-Layer Perceptron (MLP) and Gradient Boosting. We also considered two state-of-the-art deep learning methods: MOGONET [14] and Tabnet [21]. In addition, we also evaluated LR-VAE (Variational AutoEncoder) and LRCV-VAE, compressing the input dimension of miRNA, mRNA, microbiome, metabolomics and DNA methylation data to 5 via the variational autoencoder, which has been heavily used in dimension reduction for biological data [22, 23] (see Table 2 for the list of prediction methods). To compare the performance of different methods on prediction of asthma status, we first split the subjects into two groups for the following evaluation purposes: (1) Hold-out validation: among the 102 subjects that have all six omics data types available, we randomly chose 16 cases, then randomly selected 16 controls whose race and clinical center match each case. (2) Cross-validation: fivefold cross-validation was used to evaluate the performance of each classification method on the remaining subjects (in total 300). To evaluate the performance of each method, we used the standard classification performance metrics: (1) Accuracy; (2) F1-score; (3) AUROC: Area Under the Receiver Operating Characteristic (ROC) curve and (4) AUPRC: Area Under the Precision-Recall Curve (PRC).

Table 2 Prediction models for asthma development

Full size table

Feature selection

Omics data is typically high-dimensional in the sense that the number of features is significantly larger than the number of samples [24, 25]. Feature selection can filter out irrelevant and redundant features by identifying a subset of relevant features [26]. Besides, when fewer features are used as inputs in machine learning models, it also minimizes over-fitting risks. Numerous methods can be used for feature selection, e.g., univariate statistical testing, feature variance, Random Forest importance ranking, and information-theoretic measures [15, 27]. Here, we used the Wilcoxon rank-sum test on cross-validation subjects to identify the key features of count data, including miRNA, mRNA and microbiome data, due to its solid False Discover Rate (FDR) control and good power [28] (see Additional file 1: sec.2 for detail of statistical analysis). For each of those data types, the top 300 features with the lowest p-values were selected, so that the number of features is comparable to the number of subjects (249 healthy controls and 83 asthmatic cases). For continuous metabolomics and methylation data, we used the feature variance to identify the top 300 features with the largest variance across subjects [29]. We reduced the genetic data to 4 polygenic scores (PGS) computed from previous work [30] and 2 SNPs (rs4795399 and rs117097909) in the established 17q21 locus [31].

Omics data imputation

Since not all six omics data types are available for each subject, we performed data imputation first so that the evaluation of each prediction method was performed on the same set of subjects, enabling us to systematically examine the capability of each omics in the prediction of asthma development. To keep more omics data unimputed and the subject size maximized, we selected the subjects with the following three omics data types: GWAS, DNA methylation and the microbiome all available. Then, we imputed the miRNA, mRNA and metabolomics data using the following three methods, respectively: (1) median imputation: the missing value of a feature is replaced with the median value of the other samples. (2) TOBMI [32] (trans-omics block missing data): missing data of a subject in one omics is the weighted combination of k-nearest neighbors identified from another omics data. Here, the missing values of miRNA and mRNA were imputed using a k-nearest neighbors (KNN) weighted method, where a gene expression of a missing subject is the weighted combination of k nearest neighbors identified using the DNA methylation data. We leveraged this idea to impute the metabolomics data using the microbiome data. Hence, the distance matrix was constructed from the microbiome data. (3) missForest [33]: an iterative imputation method based on a random forest classifier. 66% subjects were missing one omics data type, 28% subjects were missing two omics data types, and only 5% subjects were missing all three omics data types. Note that, imputing the missing data on the original omics data requires significantly high computational effort, so we performed the imputation process after the feature selection. We emphasize that the imputation here is subject based, in the sense that the entire omics of some subjects were missing, rather than only few features within an omics were missing. Therefore, some traditional imputation methods, i.e., k-nearest neighbors cannot be directly utilized.

Results

Heathy and asthmatic children show differences in their multi-omics profiles

We firstly examined the differences in the imputed multi-omics profiles between the healthy controls (\(n=249\)) and asthmatic cases (\(n=89\)). We found a significant difference between the distributions of healthy and asthmatic groups using the t-SNE visualization (permutational multivariate analysis of variance (PERMANOVA), \(P<0.05\)), regardless of imputation methods (see Fig. 3).

There are four consistently high-performing methods in the cross-validations

Among all tested methods in fivefold cross-validations, we found that LR, LRCV, MLP and MOGONET show relatively higher performance over all four types of evaluation metrics (imputed using the median). For example, the highest Accuracy, F1, AUROC and AUPRC of LRCV are 0.92, 0.8, 0.96 and 0.89 among fivefold cross-validations. MOGONET is a novel multi-omics integrative method that jointly explores omics-specific learning and cross-omics correlation learning based on Graph Convolutional Networks (GCN) showing similar performance to LRCV (see Fig. 4; Additional file 1: Fig. S1). In particular, we found that the performance of those top-ranking methods is robust to different imputation methods (see Additional file 1: Fig. S2, S3 for missForest and TOBMI imputation). Higher performance of those four methods implies that prediction of children’s asthma development through leveraging the rich information in multi-omics is feasible.

Transcriptional and genomic data are critical for asthma prediction

Figure 4 shows the predictive performance of each prediction method across all possible combinations of six omics data types. We observed that the prediction performance largely depends on the omics used. To examine the importance of different omics combinations on children’s asthma status prediction, we ranked those 63 combinations from six omics data types based on their median performance across all prediction methods. Interestingly, we found a consistent omics importance ranking over four evaluation metrics: mRNA alone, and combinations of GWAS, miRNA and mRNA can achieve the highest performance. Especially, mRNA alone shows the highest ranking among Accuracy, AUROC and AUPRC (see Additional file 1: Fig. S4). Furthermore, we measured the importance of each feature (such as gene, mRNA, miRNA) using MOGONET, since it yields the overall best performance with omics combination of genome, miRNA, and mRNA data yields the overall best performance, we selected this omics combination and the feature importance in MOGONET was computed by the performance decrease, e.g., F1 score after the feature is removed. We found biomarkers (i.e., features with high importance scores) identified by MOGONET have also shown associations with asthma (see Additional file 1: Table S1). For example, has-miR-581, a microRNA downregulated in severe asthma, is associates with forced expiratory volume in 1 s (FEV1) and immune inflammation [34]. In addition, hsa-miR-376c-3p, hsa-miR-374b-5p, hsa-miR-374c-5p et al., are circulating microRNAs associated with lung function in asthma [35]. When compared to healthy controls, bronchial smooth muscle cells from asthmatic patients express different levels of hsa-miR-376a-3p and hsa-miR-330-5p [36]. ENSG00000267174 is a long noncoding RNA (lncRNA), and many lncRNAs have been shown to be associated with asthma severity or inflammatory phenotype [37]. ENSG00000004139 can regulate the cell survival and cytokine release after inflammasome activation [38]. Again, we found that those top-ranking omics combinations are quite robust to different imputation methods. These results suggest that accurate prediction of asthma development in children does not require sequencing as many as possible omics data. Whereas, using transcriptional with genomic data can yield superior performance for predicting asthma development at year 3.

Different imputation methods produce a similar performance

Although multi-omics analysis can provide the connections between biomolecules from different layers of omics data, one of the key challenges in multi-omics approaches is missing values within and across the omics data. Missing values across omics are a particular concern as they will result in different sample sizes among the omics, which requires imputation for the downstream analyses, i.e., classification. We compared the prediction performance of each prediction method using all 63 omics combinations imputed with three different methods, showing that median and TOBMI imputations can achieve significantly higher AUPRC than missForest (see Additional file 1: Fig. S5). Yet, the overall performance of the three imputation methods is similar.

Hold-out validation displays similar results to cross-validations

Phenotypes in biological studies are typically imbalanced; for example, most binary traits have fewer cases than controls [39]. To examine the performance of each prediction method on a balanced data set without imputation, we trained each method using all the 300 subjects in fivefold cross-validations, then evaluated them using an additional 32 subjects with 16 healthy controls and 16 asthmatic cases, respectively. Again, we found that LR and MOGONET show superior performance over other methods, i.e., the Accuracy, F1, AUROC and AUPRC of LR were 0.78, 0.74, 0.70 and 0.72, respectively, and 0.69, 0.59, 0.66 and 0.75 for MOGONET (see Fig. 5; Additional file 1: Fig. S6). In addition, we found that the combination of miRNA and mRNA achieves the highest Accuracy and AUPRC. Yet, the combination of miRNA and microbiome data can produce the highest F1 and AUROC (see Additional file 1: Fig. S7).

Utilizing covariates can further improve the prediction performance for particular omics combination

To evaluate whether including covariates together with omics data can further improve the prediction performance, we considered the following covariates associated with each subject, i.e., father and mother’s asthma status, race, as well as vitamin D level into the prediction model. Previous analysis in hold-out validation using all 63 omics combinations has shown that the combination of miRNA and mRNA or the combination between miRNA and microbiome omics can reach the optimal performance for most of the prediction methods. Here we intended to investigate the influence of covariates by examining the performance of each method before and after including those covariates in addition to best-performing omics combinations. As those covariates cannot be included easily in all prediction models, i.e., treating these covariates as an additional omics data type for MOGONET, we focused on two promising methods LR and LRCV that can fully exploit all predictors fairly. We found that that the impact of covariates on the asthma prediction depends on the omics used, e.g., it can further improve the prediction for miRNA and mRNA combination for both of LR and LRCV, regardless of the performance metrics (see Fig. 6a). Yet, including those covariates will decrease the prediction performance for the miRNA and microbiome combination (see Fig. 6b). To understand this difference, we examined the association between coefficients of each covariate in LR using two omics combinations, respectively, finding that the coefficients from two omics combinations display a positive correlation. Yet, we do find that for some covariates, such as, history of eczema or atopic dermatitis in mother, mother’s marriage status and history of hay fever or allergic rhinitis in mother are associated with high coefficients in one combination, but not for another.

Discussion

The global prevalence, morbidity, mortality and economic burden of children’s asthma has significantly increased in the past 40 years [1]. Predicting asthma development for children is imperative to understand the etiology of the disease and identify suitable interventions [10]. Yet, many diseases (including asthma) are heterogeneous, which renders the prediction of the disease status a big challenge. Here, we leveraged the rich omics collected in the VDAART cohort, examining the existing classification methods in the prediction of children’s asthma development at year 3 using multi-omics data collected at/before year 1. Our results imply that including a subset of all types of omics data is helpful in asthma outcome prediction, especially a combination of transcriptional, genomic and microbiome data can achieve optimal prediction. In addition, the imputation methods for missing values do not show a significant impact on the prediction.

Our analysis related to the impact of covariates on the asthma development prediction suggests that including the covariates in the prediction models does not always improve the performance. This also implies that the conclusion drawn from VDAART can also be valid in other cohorts, i.e., compromised of subjects with different racial distribution, as, in this study, race is not an importance predictor. However, we acknowledge the importance of replicating these findings in additional diverse populations.

Vitamin D can impact the developing of the lung and immune system during the fetal and early postnatal periods [40, 41], thus deficiency of vitamin D in pregnancy may be important in early asthma and wheezing. The VDAART Randomized Clinical Trial implies that the 3-year incidence of asthma or recurrent wheeze in the infants was 24.3% with 4400-IU/d and 30.4% with a 400-IU/d supplement [17]. This reduction demonstrates that supplementation of vitamin D may be an important intervention for child health. The prediction of children’s asthma development after including the covariates indicates that vitamin D level is associated with a reduction (negative coefficient) in the relative risk of asthma if the prediction is accurate, for instance, using the combination of miRNA and microbiome omics data types. This confirms that supplementation of vitamin D in pregnancy can reduce the risk of asthma for children.

Omics data usually contains missing values. Integration of those omics data together typically requires all omics of each subject available, which is challenging as more types of omics data are included. Data imputation enables us to systematically examine the impact of each omics data type in the prediction of disease status. Our results demonstrate that the performance of those superior methods, i.e., Logistic Regression using combinations of non-imputed omics, i.e., miRNA and microbiome still displayed superior performance than other methods.

Availability of data and materials

The data presented in this study are available upon request.

References

Braman SS. The global burden of asthma. Chest. 2006;130:4S-12S.
Article PubMed Google Scholar
Caffrey Osvald E, Bower H, Lundholm C, et al. Asthma and all-cause mortality in children and young adults: a population-based study. Thorax. 2020;75:1040–6.
Article PubMed Google Scholar
Di Resta C, Galbiati S, Carrera P, et al. Next-generation sequencing approach for the diagnosis of human diseases: open challenges and new opportunities. Ejifcc. 2018;29:4.
PubMed PubMed Central Google Scholar
Grada A, Weinbrecht K. Next-generation sequencing: methodology and application. J Invest Dermatol. 2013;133: e11.
Article PubMed Google Scholar
Kilpinen H, Barrett JC. How next-generation sequencing is transforming complex disease genetics. Trends Genet. 2013;29:23–30.
Article CAS PubMed Google Scholar
Ku CS, Naidoo N, Wu M, et al. Studying the epigenome using next generation sequencing. J Med Genet. 2011;48:721–30.
Article CAS PubMed Google Scholar
Bersanelli M, Mosca E, Remondini D, et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics. 2016;17:S15.
Article Google Scholar
Graw S, Chappell K, Washam CL, et al. Multi-omics data integration considerations and study design for biological systems and disease. Mol Omics. 2021;17:170–85.
Article CAS PubMed PubMed Central Google Scholar
Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18:83.
Article PubMed PubMed Central Google Scholar
Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights. 2020;14:117793221989905.
Article Google Scholar
Picard M, Scott-Boyer M-P, Bodein A, et al. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;19:3735–46.
Article CAS PubMed PubMed Central Google Scholar
Xie G, Dong C, Kong Y, et al. Group lasso regularized deep learning for cancer prognosis from multi-omics and clinical features. Genes. 2019;10:240.
Article CAS PubMed PubMed Central Google Scholar
Chaudhary K, Poirion OB, Lu L, et al. Deep learning-based multi-omics integration robustly predicts survival in liver cancerusing deep learning to predict liver cancer prognosis. Clin Cancer Res. 2018;24:1248–59.
Article CAS PubMed Google Scholar
Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12:3445.
Article CAS PubMed PubMed Central Google Scholar
Rohart F, Gautier B, Singh A, et al. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13: e1005752.
Article PubMed PubMed Central Google Scholar
Group CAMPR. The childhood asthma management program (CAMP): design, rationale, and methods. Controlled clinical trials 1999; 20:91–120.
Litonjua AA, Carey VJ, Laranjo N, et al. Effect of prenatal supplementation with vitamin D on asthma or recurrent wheezing in offspring by age 3 years: the VDAART randomized clinical trial. JAMA. 2016;315:362–70.
Article CAS PubMed PubMed Central Google Scholar
Weiss ST, Litonjua AA. Can we prevent childhood asthma before birth? Summary of the VDAART results so far. Expert Rev Respir Med. 2016;10:1039–40.
Article CAS PubMed PubMed Central Google Scholar
Galant SP, Morphew T, Amaro S, et al. Current asthma guidelines may not identify young children who have experienced significant morbidity. Pediatrics. 2006;117:1038–45.
Article PubMed Google Scholar
Buitinck L, Louppe G, Blondel M, et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238 2013.
Arik SO, Pfister T. TabNet: Attentive Interpretable Tabular Learning. arXiv:1908.07442 [cs, stat] 2020.
Lin E, Mukherjee S, Kannan S. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis. BMC Bioinformatics. 2020;21:1–11.
Article Google Scholar
Wang D, Gu J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genomics Proteomics Bioinformatics. 2018;16:320–31.
Article PubMed PubMed Central Google Scholar
Leclercq M, Vittrant B, Martin-Magniette ML, et al. Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data. Front Genet. 2019;10:452.
Article CAS PubMed PubMed Central Google Scholar
Moon KR, van Dijk D, Wang Z, et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol. 2019;37:1482–92.
Article CAS PubMed PubMed Central Google Scholar
Bommert A, Sun X, Bischl B, et al. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal. 2020;143: 106839.
Article Google Scholar
Du W, Cao Z, Song T, et al. A feature selection method based on multiple kernel learning with expression profiles of different types. BioData Mining. 2017;10:4.
Article PubMed PubMed Central Google Scholar
Li Y, Ge X, Peng F, et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol. 2022;23:79.
Article CAS PubMed PubMed Central Google Scholar
Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform. BMC Bioinformatics. 2012;13:59.
Article CAS PubMed PubMed Central Google Scholar
Sordillo JE, Lutz SM, Jorgenson E, et al. A polygenic risk score for asthma in a large racially diverse population. Clin Exp Allergy. 2021;51:1410–20.
Article PubMed PubMed Central Google Scholar
Ferreira MA, Mathur R, Vonk JM, et al. Genetic architectures of childhood-and adult-onset asthma are partly distinct. Am J Hum Genet. 2019;104:665–84.
Article CAS PubMed PubMed Central Google Scholar
Dong X, Lin L, Zhang R, et al. TOBMI: trans-omics block missing data imputation using a k-nearest neighbor weighted approach. Bioinformatics. 2019;35:1278–83.
Article CAS PubMed Google Scholar
Stekhoven DJ, Buhlmann P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28:112–8.
Article CAS PubMed Google Scholar
Francisco-Garcia AS, Garrido-Martín EM, Rupani H, et al. Small RNA species and microRNA profiles are altered in severe asthma nanovesicles from broncho alveolar lavage and associate with impaired lung function and inflammation. Noncoding RNA. 2019;5:51.
Article CAS PubMed PubMed Central Google Scholar
Kho AT, Sharma S, Davis JS, et al. Circulating MicroRNAs: association with lung function in asthma. PLoS ONE. 2016;11: e0157998.
Article PubMed PubMed Central Google Scholar
Alexandrova E, Miglino N, Hashim A, et al. Small RNA profiling reveals deregulated phosphatase and tensin homolog (PTEN)/phosphoinositide 3-kinase (PI3K)/Akt pathway in bronchial smooth muscle cells from asthmatic patients. J Allergy Clin Immunol. 2016;137:58–67.
Article CAS PubMed Google Scholar
Gysens F, Mestdagh P, de Bony de Lavergne E, et al. Unlocking the secrets of long non-coding RNAs in asthma. Thorax. 2022;77:514–22.
Article PubMed Google Scholar
Carty M, Kearney J, Shanahan KA, et al. Cell survival and cytokine release after inflammasome activation is regulated by the Toll-IL-1R protein SARM. Immunity. 2019;50:1412-1424.e6.
Article CAS PubMed Google Scholar
Zhou W, Nielsen JB, Fritsche LG, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335–41.
Article CAS PubMed PubMed Central Google Scholar
Zosky GR, Berry LJ, Elliot JG, et al. Vitamin D deficiency causes deficits in lung function and alters lung structure. Am J Respir Crit Care Med. 2011;183:1336–43.
Article PubMed Google Scholar
Yurt M, Liu J, Sakurai R, et al. Vitamin D supplementation blocks pulmonary structural and functional changes in a rat model of perinatal vitamin D deficiency. Am J Physiol Lung Cell Mol Physiol. 2014;307:L859–67.
Article CAS PubMed PubMed Central Google Scholar
Tolles J, Meurer WJ. Logistic regression: relating patient characteristics to outcomes. JAMA. 2016;316:533–4.
Article PubMed Google Scholar
Doersch C. Tutorial on Variational Autoencoders. arXiv:1606.05908 [cs, stat] 2016.
Arnold TB. kerasR: R Interface to the keras deep learning library. J Open Source Softw. 2017;2:296.
Article Google Scholar
Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46:175–85.
Google Scholar
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:1–27.
Article Google Scholar
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55:119–39.
Article Google Scholar
Hastie T, Rosset S, Zhu J, et al. Multi-class AdaBoost. Statis Interface. 2009;2:349–60.
Article Google Scholar
Friedman JH. Greedy function approximation: a gradient boosting machine. Annal Statis. 2001;29:1189–232.
Article Google Scholar
Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot. 2013;7:21.
Article PubMed PubMed Central Google Scholar
Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition 1995; 1:278–282
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
Article Google Scholar
Loh W-Y. Classification and regression trees. Wiley Interdiscip Rev Data Mining Knowl Discov. 2011;1:14–23.
Article Google Scholar
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42.
Article Google Scholar
McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization 1998; 752:41–48.
Zhang H. The optimality of naive Bayes. Aa. 2004;1:3.
Google Scholar
Hinton GE. Connectionist learning procedures. Mach Learn. 1990; 555–610.

Download references

Acknowledgements

We wish to thank all VDAART participants. We thank John P. Ziniti, Rob Chase, Kathleen Lee-Sarwar, Nancy Laranjo, Mike McGeachie, Hooman Mirzakhani, Priyadarshini Kachroo, and Jody Sylvia for preparing the omics data. We thank Kimberly Glass, Arda Halu, and Enrico Maiorino for valuable discussion.

Funding

VDAART was supported by grant U01HL091528 from the NHLBI. YYL was supported by the National Institutes of Health (grant numbers: R01AI141529, R01HD093761, RF1AG067744, UH3OD023268, U19AI095219, U01HL089856).

Author information

Authors and Affiliations

Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
Xu-Wen Wang, Tong Wang, Can Chen, Zheng Sun, Shanlin Ke, Julian Hecker, Oana A. Zeleznik, Dawn L. DeMeo, Jessica Lasky-Su, Edwin K. Silverman, Yang-Yu Liu & Scott T. Weiss
Department of Mathematics, University of Hamburg, 21109, Hamburg, Germany
Darius P. Schaub
Genetics and Aging Research Unit, Department of Neurology, McCance Center for Brain Health, Mass General Institute for Neurodegenerative Disease, Massachusetts General Hospital, Harvard Medical School, Charlestown, MA, USA
Anna Maaser-Hecker
Department of Radiation Oncology, Brigham and Women’s Hospital, Boston, MA, USA
Roman Zeleznik
Division of Pediatric Pulmonology, Golisano Children’s Hospital, Rochester, NY, USA
Augusto A. Litonjua
Center for Artificial Intelligence and Modeling, The Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Yang-Yu Liu

Authors

Xu-Wen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Darius P. Schaub
View author publications
You can also search for this author in PubMed Google Scholar
Can Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shanlin Ke
View author publications
You can also search for this author in PubMed Google Scholar
Julian Hecker
View author publications
You can also search for this author in PubMed Google Scholar
Anna Maaser-Hecker
View author publications
You can also search for this author in PubMed Google Scholar
Oana A. Zeleznik
View author publications
You can also search for this author in PubMed Google Scholar
Roman Zeleznik
View author publications
You can also search for this author in PubMed Google Scholar
Augusto A. Litonjua
View author publications
You can also search for this author in PubMed Google Scholar
Dawn L. DeMeo
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Lasky-Su
View author publications
You can also search for this author in PubMed Google Scholar
Edwin K. Silverman
View author publications
You can also search for this author in PubMed Google Scholar
Yang-Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Scott T. Weiss
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YYL and STW conceived and designed the project. XWW, TW, DPS, CC, ZS, SLK, JH, AMH, OAZ, RZ evaluated the methods. XWW and YYL drafted the manuscript. All authors contributed to result interpretation and revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yang-Yu Liu or Scott T. Weiss.

Ethics declarations

Ethics approval and consent to participate

VDAART IRB approval was obtained from each of the three clinical centers and the Data Coordinating Center, that is, Washington University in St. Louis, Kaiser Health Care San Diego, Boston Medical Center and Brigham and Women’s Hospital in Boston. Study subjects provided written, informed consent.

Consent for publication

Not applicable.

Competing interests

In the past three years, EKS received grant support from GlaxoSmithKline and Bayer. Other authors declare no competing interests. Scott. T. Weiss receives author royalty from UpToDate, and is on the Board of Directors of Histolix. Jessica Lasky-Su is a consultant for TruDiagnositc, and is on the Scientific Advisory Board of Precion. Augusto A. Litonjua is on the Data Safety Monitoring Board of PreCISE Network, and receives author royalty from UpToDate.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional

file 1: Figure S1: Prediction performance of each prediction method in cross-validation. Figure S2: Prediction performance of classification models using all six omics combinations in cross-validation imputed by TOMBI. Figure S3: Prediction performance of classification models using all six omics combinations in cross-validation imputed by missForest. Figure S4: Omics combination importance in cross-validation. Table S1: Important omics biomarkers identified by MOGONET using genome, miRNA and mRNA data. Figure S5: Performance comparison between different imputation methods. Figure S6: Omics combination importance in hold-out validation. Figure S7: Prediction performance of each method in hold-out validation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Wang, XW., Wang, T., Schaub, D.P. et al. Benchmarking omics-based prediction of asthma development in children. Respir Res 24, 63 (2023). https://doi.org/10.1186/s12931-023-02368-8

Download citation

Received: 04 January 2023
Accepted: 16 February 2023
Published: 26 February 2023
DOI: https://doi.org/10.1186/s12931-023-02368-8

Benchmarking omics-based prediction of asthma development in children

Abstract

Background

Objective

Method

Results

Conclusions

Background

Methods

VDAART cohort

Prediction methods and performance evaluation

Feature selection

Omics data imputation

Results

Heathy and asthmatic children show differences in their multi-omics profiles

There are four consistently high-performing methods in the cross-validations

Transcriptional and genomic data are critical for asthma prediction

Different imputation methods produce a similar performance

Hold-out validation displays similar results to cross-validations

Utilizing covariates can further improve the prediction performance for particular omics combination

Discussion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Respiratory Research

Contact us