A comprehensive study on machine learning models combining with oversampling for bronchopulmonary dysplasia-associated pulmonary hypertension in very preterm infants

Background Bronchopulmonary dysplasia-associated pulmonary hypertension (BPD-PH) remains a devastating clinical complication seriously affecting the therapeutic outcome of preterm infants. Hence, early prevention and timely diagnosis prior to pathological change is the key to reducing morbidity and improving prognosis. Our primary objective is to utilize machine learning techniques to build predictive models that could accurately identify BPD infants at risk of developing PH. Methods The data utilized in this study were collected from neonatology departments of four tertiary-level hospitals in China. To address the issue of imbalanced data, oversampling algorithms synthetic minority over-sampling technique (SMOTE) was applied to improve the model. Results Seven hundred sixty one clinical records were collected in our study. Following data pre-processing and feature selection, 5 of the 46 features were used to build models, including duration of invasive respiratory support (day), the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH. Four machine learning models were applied to predictive learning, and after comprehensive selection a model was ultimately selected. The model achieved 93.8% sensitivity, 85.0% accuracy, and 0.933 AUC. A score of the logistic regression formula greater than 0 was identified as a warning sign of BPD-PH. Conclusions We comprehensively compared different machine learning models and ultimately obtained a good prognosis model which was sufficient to support pediatric clinicians to make early diagnosis and formulate a better treatment plan for pediatric patients with BPD-PH.


Introduction
With the ongoing advancements in perinatal technology, there has been a remarkable increase in the success rate of treating very and extremely preterm infants diagnosed with bronchopulmonary dysplasia (BPD) [1].However, some BPD infants may develop pulmonary hypertension (PH) which significantly impacts their mortality rate.Studies have demonstrated that the incidence of PH in preterm infants with BPD is high as 26.8% in other countries [2] with the mortality rate ranging from 14 to 38%, [3,4] or even 50% in some countries [5,6].Even survivors still face short-and long-term adverse complications.The pathogenesis of BPD-PH remains unclear, and there is no safe and effective treatment for BPD-PH at present.Early prevention and timely diagnosis prior to pathological change are the key to reducing morbidity and improving prognosis [7].How to identify infants at risk for BPD-PH has therefore become one of the hotspots of research for pediatric clinicians.
So far, there have been few studies reporting the establishment of BPD-PH predictive models [8].Individual studies reporting the establishment of BPD-PH predictive models used regression coefficients to indicate the degree of correlation between risk factors and diseases, the algorithm that they used was relatively simple, and the results could not accurately give specific probability values.With the development of medical and health big data infrastructure, the form and quantity of medical data continue to improve, which gives us a chance to address the aforementioned technical caveats with a larger scale of training and more rigorous validation.In addition, to cope with the commonly seen unbalancing issue among real-world datasets, some machine learning techniques, such as oversampling could be tried out.Synthetic minority over-sampling technique (SMOTE) [9] is an oversampling method based on neural networks, which simulates the learning process of neural network to learn missing values and outliers.In this retrospective study, we intended to answer the following research questions by using the inpatient medical records from four different major tertiary-level medical centers in China: RS1 -Can machine learning methods support BPD-PH risk prediction in a more practical clinical setting across different organizations?RS2 -Can oversampling techniques such as SMOTE help improve the prediction results?

Data collection Patients
This study was approved by the research ethics board of the Seventh Medical Center of PLA General Hospital (No. 2022-02), and consisted with the Declaration of Helsinki.Informed consent was waived.The study population included very preterm infants (VPIs) with BPD who were born or admitted to the Seventh Medical Center of PLA General Hospital (Beijing, Chia), Qingdao Women and Children's Hospital (Qingdao, China), Tianjin Central Hospital of Gynecology Obstetrics (Tianjin, China), and Guangdong Women and Children Hospital (Guangzhou, China) between August 1, 2015 and February 28, 2022.A total of 801 VPIs with BPD less than 32 weeks of gestational age were collected.Among these, there are 626 newborns in The Seventh Medical Center of PLA General Hospital, 69 newborns in Qingdao Women and Children's Hospital, 59 newborns in Tianjin Central Hospital of Gynecology Obstetrics, and 47 newborns in Guangdong Women and Children Hospital.The following criteria were applied to construct the initial dataset: Inclusion criteria were VPIs with gestational age less than 32 gestational weeks who were diagnosed with BPD.The diagnostic criteria and severity of BPD were based on the definition proposed in the conference of the National Institute of Child Health and Human Development in June 2000 [10], when their gestational age was corrected at 36 weeks.
Exclusion criteria were VPIs 1) at an admission age > 3 days or those who were not hospitalized beyond 36 weeks' corrected gestation (due to discharge or death); 2) with congenital lung diseases, other anatomical abnormalities (such as diaphragmatic hernia or thoracic malformation), congenital heart disease except for patent ductus arteriosus (PDA), patent foramen ovale (PFO), atrial septal defect (ASD) and small interventricular septal defect (VSD) (defect diameter < 5 mm/m 2 of body surface area) leading to PH.
VPIs with an admission age > 3 days or those hospitalized for < 36 weeks' corrected gestation were considered valid subjects, knowing that admission age > 3 days may result in missing maternal and neonatal clinical data, while hospitalization < 36 weeks' corrected gestation may affect the diagnosis of severity of BPD and PH.Therefore, a total of 40 cases of BPD infants were excluded and 761 cases of BPD infants were enrolled finally.
According to the results of Doppler echocardiography at least over 36 weeks' corrected gestation, the patients were divided into BPD-PH group and BPD group.

Clinical features
A total of 46 clinical features which are the same with variables were collected by reviewing the medical records during hospitalization, including maternal pregnancy factors, newborn clinical data, conditions of treatment and echocardiography-related information.
Maternal pregnancy factors included cesarean section, hypertension in pregnancy, gestational diabetes mellitus, intrauterine distress, preeclampsia, multiple births, natural conception, placental abnormality, placental abruption, premature rupture of the membrane (PROM), duration of PROM more than 18 h, and oligohydramnios.
Conditions of treatment included duration of invasive respiratory support (day), duration of nasal cannula oxygenation (day), duration of noninvasive respiratory support (day), corticosteroids used for treatment of BPD, usage of pulmonary surfactant (PS), a single dose, multiple doses, usage of vasoactive agents, usage of caffeine, and ligation of the PDA.
Echocardiography-related information included PFO, maximum diameter of PDA, VSD, ASD, the peak velocity of tricuspid regurgitant flow, velocity and direction of blood flow in PDA, shunt at the level of PFO, shunt at the level of VSD, and position of the ventricular septum.
These clinical features cover most of the factors that may affect the disease, allowing subsequent models to more accurately predict whether patients are at risk of developing the disease.The diagnostic criteria referred to related criteria [15,16].And all diagnostic and treatment times are before or at the time of correcting the gestational age of 36 weeks except BPD-PH.Early-onset PH was defined as PH occurring between 72 h and 14 days after birth [17].BPD-PH was defined as PH occurring after 36 weeks' corrected gestation [18].PH echocardiographic diagnostic criteria, ① Right ventricular systolic pressure (RVSP) > 35 mmHg (1 mmHg = 0.133 kPa); RVSP = (tricuspid valve flow velocity) 2 × 4 + right atrial pressure (usually 5 mmHg); ② RVSP/ sBP ratio > 0.5; ③ Any VSD or PDA with bidirectional or right-to-left shunt; ④ If there is no tricuspid valve regurgitation or shunt, then meet 2 of the following 3 criteria: ① Any degree of flattening of the ventricular septum; ②Right ventricular dilation; ③Right ventricular hypertrophy [19].It is also noteworthy that the dataset exhibited an imbalance between the amount of BPD-PH data and the amount of data without PH.

Data preprocessing
The dataset was divided into two parts: 80% for training and 20% for validation.In the preprocessing stage of the study, several steps (including categorical feature encoding, missing value imputation and feature selection) were performed to prepare the training datasets for the machine learning methods.Firstly, the features of the BPD-PH status are described in (Table 1).A p-value < 0.05 was considered statistically significant.27 statistically significant indicators would be used in next steps.Secondly, categorical features were encoded into numerical form.The rules (Table 2) were utilized.Then, missing values in the datasets were addressed through imputation.The missing values in continuous features were replaced with the mean value of the respective feature, while the missing values in categorical features were replaced with the most frequent category.Lastly, in statistically significant indicators in single factor analysis, SelectKBest was utilized to select the most relevant features and evaluation function f_regression was applied to assess the correlation between the features and the target features [20].The top 10 features with the highest scores (Table 3) are listed for analysis, constituting a better feature set for subsequent model training.

Oversampling
In our study population, the BPD-PH group accounted for 13% which was significantly lower than that of BPD group.Due to the lack of sufficient data, the classifier lacked the ability to characterize the BPD-PH group, making effective classification difficult.To reduce the impact of data imbalance and increase the accuracy of the model, oversampling of rare classes was used.For comparison, the number of samples in BPD-PH group was increased by using SMOTE.

Classification model and evaluation indicator
In our study, multiple machine learning methods were used and compared.These methods included multivariate logistic regression [21,22], decision tree [23], random forest [24] and neural network [22].Their performance was evaluated and compared on the same dataset to determine which method performed best for prediction.
To improve the training results, hyperparameters were adjusted separately for each method.The multivariate predictive model results were evaluated under the metrics accuracy, sensitivity, specificity and negative predictive value (NPV).

Results
During

Feature selection
Based on the scores and rankings (Table 3), 5 features were chosen for model training and functional evaluation including the duration of invasive respiratory support, the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH.

Training and evaluation
Training was performed by four machine learning methods and two sampling methods (original and SMOTE).Table 4 shows the results of different groups.When the model was trained directly using the training datasets without any data oversampling, it was possible to achieve a high accuracy and specificity.The accuracy scores, in logistics regression and neural network groups, could both achieve over 90%.However, due to imbalanced data, the classifier lacked the ability to characterize the BPD-PH group.So, the sensitivity, which indicates the detection rate of BPD-PH, tended to be low before oversampling.The accuracy scores in decision tree and random forest were lower than those in the previously mentioned groups, but the sensitivity was higher.The sensitivity of the model using SMOTE were significantly improved in logistics regression and neural network groups, but the effect was relatively small in random forest and decision tree groups.Logistics regression and neural network groups exhibited the highest sensitivity (93.8%) and also relatively higher in other indicators.The final choice was the logistic regression model, because it provided interpretable results.The coefficients associated with each independent features could be easily interpreted as the log odds ratio, allowing us to understand the direction and magnitude of the effects.However, neural networks could capture complex patterns and relationships in large datasets.As the size of the data increased and the quality improved, neural networks may become more effective, so it is worthwhile to try out in our future work.If the value of logit(P) was greater than 0 (i.e.logit(P) = 0, odds = P/(1-P) = 1, P = 0.5), it was predicted as a positive class and otherwise as a negative class.A nomogram was developed based on the predictive model to improve the convenience of the model in clinical practice (Fig. 3).

Discussion
In this study, we established a good model of predicting PH in VPIs with BPD by collecting the clinical data from four tertiary-level hospitals in China by using an artificial intelligence algorithm and different models to select the final indicators.By using advanced machine learning techniques, we could identify some important clinical factors associated with BPD-PH, such as, duration of invasive respiratory support (day), the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH.
An ability to accurately predict BPD-PH and then taking measures to intervene were clinically important to avoid the bad survival outcome of infants with BPD.Presently, there exists a deficiency in a dependable tool that can accurately forecast BPD-PH during its initial  stages.In this study, through using oversampling algorithms and multi-institutional clinical data, we finally had an overall good performance.These indicators are readily available clinically, which do not cause harm to children, and neonatologists can use the score corresponding to the formula to visually assess the risk of BPD-PH in VPIs.Scores greater than 0 should be considered a warning sign for developing BPD-PH.Compared with the study of Collaco et al. [25] our study had the same patient criteria for all data and did a calibration test.Compared with the study of Trittmann et al. [26] The AUC of our study was obviously higher and the data of our study were more accessible.In addition, we had a larger sample size and did verification.Although the AUC of our study was lower than that of the study by WANG et al. [27] our study was multi-institutional with a larger sample size.Multi-institutional study involves collecting data from different regions and healthcare institutions, thereby increasing the representativeness and generalizability of the samples.Additionally, multi-center data can help validate and verify the consistency and universality of research findings.Therefore, utilizing multi-center data for disease prediction provides more reliable and comprehensive information.The sample size was large enough in our study to obtain good results and we combined machine learning techniques.It is our hope that the model developed herein will help neonatologists to identify VPIs at risk of BPD-PH in time in the future and help pediatric clinicians to reduce the incidence and mortality of BPD-PH.We found duration of invasive respiratory support (day) is clearly correlated with BPD-PH.Due to lung immaturity of VPIs, respiratory support is an important treatment for BPD patients.During the process of ventilator treatment, lung inflammation and capillary endothelial cell damage are more likely to occur in immature lung tissues.Vascular remodeling and pulmonary artery pressure increased [18].Therefore, protective ventilation strategies should be adopted for preterm infants to minimize the injury caused by mechanical ventilation.
Ventilator-associated pneumonia, the severity of BPD, early-onset PH are also important factors in the occurrence of BPD-PH in our study.Ventilator-associated pneumonia is an independent risk factor for the development of BPD-PH, and is also a postnatal respiratory injury that disrupts the growth of pulmonary vessels and alveoli [28].The presence of lung damage caused by prolonged ventilator use increases the number of inflammatory cells and mediators in the systemic circulation [29], and repeated inflammatory infections of the lungs make ventilator evacuation difficult, thus prolonging the duration of mechanical ventilation, which forms a vicious circle.Similarly, many studies have shown that moderate to severe BPD is associated with the development of BPD-PH [30][31][32], which is consistent with our study.Preterm infants with moderate to severe BPD often have severe lung developmental disorders, and long-term mechanical ventilation induced pulmonary inflammation and aggravates alveolar epithelial cell damage, resulting in PH finally.And a prospective study [17] also found that early-onset PH was significantly associated with BPD progression and BPD-PH in preterm infants.Therefore, early-onset PH is an independent risk factor for the development of BPD-PH [33], which is also supported by the model developed in this study.Animal studies have shown that an increase in pulmonary vascular pressure may directly damage vascular growth and alveolarization during lung development.In a study using a chronic intrauterine pulmonary hypertension model in fetal sheep, Grover et al. [34] found that chronic PH resulted in thickening of the pulmonary arteriole walls, decreased pulmonary arterial density, and simplified alveoli.These studies suggest that early-onset PH may play a role in the progression of BPD-PH by early impairing lung development and affecting vascular and alveolar development.These clinical factors are closely related and interact with each other, ultimately leading to the occurrence of BPD-PH.
In addition, our study suggests that pulmonary hemorrhage is also an important factor in the development of BPD-PH.This is an interesting finding, as few previous studies have reported pulmonary hemorrhage as a risk factor for BPD-PH.VPIs with pulmonary hemorrhage was mainly caused by other primary diseases such as infection and often had lower gestational age and body weight, and their lung development is absolutely immature.Previous studies [35] have shown that pulmonary hemorrhage children had a high incidence of BPD.So, these children often have higher requirements for ventilator parameters when pulmonary hemorrhage occurs, and the longer the usage of ventilator, the more severe the damage to the airway, pulmonary blood vessels and lung interstitium, and the more difficult to wean the ventilator, leading to an increased incidence of BPD-PH.Therefore, prevention and timely and effective treatment of pulmonary hemorrhage are beneficial in preventing the occurrence of BPD-PH.
This study has some strengths and weaknesses.In this study, we used the oversampling technique to address the issue of imbalanced positive and negative cases, finding that it could significantly improve the accuracy of the results.By using the oversampling technique, we artificially increased the number of minority class samples, ensuring a more balanced representation of both positive and negative cases in our dataset.SMOTE generated synthetic examples of PH-group, effectively increasing its presence in the training data.This helps the model learn and generalize better, improving its ability to accurately classify both positive and negative instances.Our study used four machine learning methods for training, and selected the optimal model through comparison.The advantage of using multiple machine learning methods lies in the ability to analyze and understand the data from different perspectives.Each method has its own underlying assumptions and algorithms, which can capture unique patterns and insights within the dataset.By comparing the results obtained from different methods, we can gain a comprehensive understanding of the data and identify the best-performing model.We found that decision tree and random forest performed relatively well before oversampling, but logistic regression performed better after oversampling.Both logistics regression and neural network performed well after oversampling and the final choice was the logistic regression model, because it provided interpretable results.However, as the size of the data increases and the quality improves, neural networks may become more effective, so it is worthwhile to try out in our future work.This study also had some limitations.Firstly, this model is mainly suitable for patients at or after the time of correcting the gestational age of 36 weeks, but for patients who are less than 36 weeks correcting the gestational age, we can also use it after inferring based on their condition.Then we can intervene with patients early or close follow-up should be conducted after these high-risk children patient is discharged.Secondly, our model involves four centers, but one center providing the bulk of the samples.This may limit the generalizability beyond the centers that provided the dataset.We need to continue to collect more positive cases from more centers for verification to improve the accuracy of the model and retry a neural network way to build a better predictive model.Thirdly, due to the unbalanced data, oversampling method was used in our study.However, oversampling may promote overfitting of the model to minority class samples, reducing generalizability beyond the dataset.We are going to carry out a prospective cohort study and follow-up to overcome this problem.

Conclusion
In this study, we established a predictive model of BPD-PH by using five most significant of 46 clinical features.The predictive model could help clinicians to make early diagnosis and formulate better treatment plans for VPIs with BPD-PH in that it presented good performance for prediction and offered an AUC of 93.3%.Of course, larger-sample studies using other machine learning techniques to develop more BPD-PH predictive models are required to verify the findings and conclusions of the present study.

Figure 2
Figure 2 shows the result of the selection model.The area under the ROC curve (AUC) of this predictive model was 0.933.With the regression analysis, a predictive model was finally established.

Fig. 1 A
Fig. 1 A composition chart of study participants

Fig. 2 Fig. 3 A
Fig. 2 The performance of the selected model.(a) The confusion matrix and (b) the ROC curve.The AUC of this predictive model is 0.93

Table 1
Clinical characteristics between PH group and non-PH group The distribution of continuous features in the table is expressed by mean ± SD.The p-value based on t test for continuous features and chi-square for categorical features Abbreviations: PDA patent ductus arteriosus, BPD bronchopulmonary dysplasia, PH pulmonary hypertension, NRDS neonatal respiratory distress syndrome, NEC neonatal necrotizing enterocolitis, ROP retinopathy of prematurity, SGA small for gestational age, MAS meconium aspiration syndrome, PROM premature rupture of the membrane, PS pulmonary surfactant, VSD ventricular septal defect, ASD atrial septal defect, PFO patent foramen ovale, SD standard deviation

Table 2
Value pairs before and after encoding

Table 3
Top 10features with the highest regression scores Abbreviations: BPD bronchopulmonary dysplasia, PH pulmonary hypertension, PDA patent ductus arteriosus

Table 4
The evaluating indicators of the models Abbreviations: SMOTE Synthetic minority over-sampling technique, NPV Negative predictive value, AUC Area under the ROC curve