Skip to main content

Table 2 Performance metrics for each model in development, validation, and cumulative cohorts

From: An artificial intelligence approach for predicting death or organ failure after hospitalization for COVID-19: development of a novel risk prediction tool and comparisons with ISARIC-4C, CURB-65, qSOFA, and MEWS scoring systems

Dataset

Model

AUC (95% CI)

Sensitivity (95% CI)

Specificity (95% CI)

PPV (95% CI)

NPV (95% CI)

F1 score*

Kappa†

Brier score

Cut-off

Youden index**

Development

NN

0.907 (0.885–0.929)

0.973 (0.961–0.982

0.495 (0.423–0.567)

0.913 (0.895–0.928)

0.770 (0.686–0.840)

0.942

0.547

0.078

0.162

0.670

SVM

0.897 (0.874–0.920)

0.969 (0.957–0.979)

0.449 (0.378–0.521)

0.905 (0.886–0.921)

0.727 (0.639–0.804)

0.936

0.495

0.085

0.13

0.654

GBM

0.914 (0.893–0.935)

0.982 (0.972–0.989)

0.439 (0.368–0.511)

0.905 (0.886–0.921)

0.819 (0.732–0.887)

0.942

0.519

0.077

0.152

0.678

EN

0.908 (0.886–0.930)

0.969 (0.957–0.979)

0.505 (0.433–0.577)

0.914 (0.896–0.93)

0.750 (0.667–0.821)

0.941

0.547

0.078

0.09

0.673

LR

0.900 (0.878–0.923)

0.968 (0.956–0.978)

0.500 (0.428–0.572)

0.913 (0.895–0.929)

0.742 (0.659–0.815)

0.940

0.540

0.080

0.143

0.657

Comparisons

χ2

1.500

4.213

1.500

1.371

2.833

4

4

4

  

p

0.827

0.378

0.827

0.849

0.586

0.406

0.406

0.406

  

Validation

NN

0.852 (0.804–0.900)

0.969 (0.949–0.983)

0.427 (0.318–0.541)

0.904 (0.874–0.929)

0.714 (0.567–0.834)

0.935

0.474

0.087

0.289

0.564

SVM

0.851 (0.804–0.898)

0.974 (0.954–0.986)

0.366 (0.262–0.480)

0.895 (0.865–0.921)

0.714 (0.554–0.843)

0.933

0.424

0.089

0.147

0.556

GBM

0.849 (0.800–0.898)

0.974 (0.954–0.986)

0.341 (0.240–0.454)

0.892 (0.861–0.917)

0.700 (0.535–0.834)

0.931

0.399

0.089

0.216

0.587

EN

0.851 (0.802–0.899)

0.965 (0.944–0.980)

0.439 (0.330–0.553)

0.905 (0.876, 0.930)

0.692 (0.549–0.813)

0.934

0.475

0.088

0.216

0.571

LR

0.856 (0.809–0.903)

0.967 (0.946–0.981)

0.402 (0.296–0.517)

0.900 (0.870–0.925)

0.688 (0.537–0.813)

0.932

0.445

0.086

0.227

0.573

Comparisons

χ2

1.290

1.433

1.500

1.500

1.089

4

4

4

  

p

0.863

0.838

0.827

0.827

0.896

0.406

0.406

0.406

  

Cumulative cohort

CORE-COVID-19

0.880 (0.858–0.901)

0.904 (0.889–0.919)

0.669 (0.610–0.724)

0.937 (0.924–0.949)

0.562 (0.507–0.616)

0.921

0.532

0.156

8

0.593

ISARIC-4C

0.751 (0.720–0.781)

0.794 (0.773–0.814)

0.565 (0.504–0.624)

0.909 (0.892–0.924)

0.334 (0.291–0.379)

0.847

0.28

0.214

12.6

0.359

CURB-65

0.735 (0.705–0.765)

0.936 (0.923–0.948)

0.295 (0.242–0.352)

0.879 (0.862–0.894)

0.458 (0.384–0.534)

0.907

0.27

0.133

2

0.374

qSOFA

0.676 (0.644–0.707)

0.967 (0.957–0.975)

0.209 (0.162–0.261)

0.870 (0.853–0.885)

0.537 (0.438–0.633)

0.916

0.234

0.135

1

0.268

MEWS

0.674 (0.640–0.708)

0.850 (0.831–0.867)

0.378 (0.320–0.438)

0.882 (0.864–0.898)

0.315 (0.266–0.368)

0.865

0.21

0.147

3

0.258

  1. AUC, area under the receiver operating characteristic curve; CI, confidence interval; CORE-COVID-19, Collaboration for Risk Evaluation; COVID-19; CURB-65 score based on confusion, urea, respiratory rate, blood pressure, and age ≥ 65 years; EN, ensemble model; GBM, gradient boosting machine; ISARIC-4C, International Severe Acute Respiratory and emerging Infections Consortium Coronavirus Clinical Characterization Consortium; LR, logistic regression; MEWS, modified early warning score; NN, neural network; qSOFA, quick sequential organ failure assessment; PPV, positive predictive value; NPV, negative predictive value; SVM, support vector machine
  2. *F1 score = 2 × (positive predictive value × sensitivity)/ (positive predictive value + sensitivity); Ranges between 0 and 1, higher the value better the performance: score 0.8–0.9 indicates good and > 0.9 represent very good performance
  3. †Kappa = A measure of the performance of a classification model controlling for the accuracy; score < 0 is indicates no agreement, 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement
  4. ¶Brier score = mean squared difference between observed and predicted outcome, a measure of calibration, ranges from 0 to 1 with 0 representing the best and 1 represent worst calibration
  5. **Youden index = sensitivity (%) + specificity (%) – 100; ranges from 0 to 1 with 1 representing perfect test