Despite the description of COPD subtypes more than 40 years ago and substantial progress since then in understanding COPD-related phenotypes[33, 34], only a few attempts have been made to use statistical methods to define novel COPD subtypes[15, 16]. Using a large, well-characterized set of subjects with severe emphysema, we demonstrate the potential utility of using statistical learning methods to find relationships among phenotypic and genotypic characteristics to elucidate disease heterogeneity.
Several methods have attempted to address issues of disease heterogeneity in obstructive airway diseases. Statistical learning techniques such as factor analysis have been used to reveal novel insights into characteristics such as dyspnea or inflammation in COPD[20, 35–37]. Cluster analysis has confirmed classic chronic bronchitis and emphysema subtypes or illustrated overlap of characteristics of COPD and asthma, and a combination of factor analysis and cluster analysis has defined asthma subtypes. These techniques show promise in identifying disease subtypes (subsets of subjects), or intermediate disease-related phenotypic characteristics (endotypes/endophenotypes). Endophenotypes have already been of substantial utility in genetic association studies in psychiatry.
To date, however, there has been limited use of disease subtypes in genetic association studies in COPD. Investigators have tested for specific associations with classic subtypes[11, 40, 41], or with specific disease-related phenotypic characteristics such as emphysema distribution or functional measures. Factor analysis has been used to demonstrate differences in heritability of components of asthma. Cluster analysis is frequently used in gene expression, and such analyses have been used to define subtypes - though these subtypes have not always been clearly associated with the available clinical characteristics. Our study demonstrates the potential utility of statistical learning methods in the heterogeneous syndrome of COPD.
Our cluster analyses identified four subtypes of subjects in this cohort with severe emphysema: 1) emphysema predominant, 2) milder severity, bronchodilator-responsive, 3) discordant lung function/CT emphysema and airway severity, and 4) airway predominant. Some of the phenotypic associations in these groups, such as a lower BMI with more severe quantitative CT emphysema, have been previously seen[13, 45], while others, such as a higher bronchodilator responsiveness in the group with higher FEV1, differ from previous reports[46, 47]. The association of the nonsynonymous Leu10Pro TGFB1 SNP rs1800470 with cluster 1 is consistent with a previously reported association of apical emphysema in this cohort  and association of this SNP with reduced lung function has also been seen in a Japanese emphysema cohort. Notably, this SNP has been demonstrated to be of functional significance, with the G allele (C on the reverse strand) resulting in increased production of TGFB1. Several studies have demonstrated an increase in TGFB1 both in the lung[50–52] and in plasma in subjects with COPD, as well as a relationship between TGFB1 levels and lung function, though the relationship between these findings and the rs1800470 genotype is not entirely clear.
Conversely, most of the previously reported SNP associations with COPD-related phenotypic characteristics did not demonstrate associations with our clusters. Nonsignificant findings could be due to loss of power from categorical cluster assignment and resulting small sample size, and the use of an omnibus test for genetic association. More importantly, our analysis attempts to determine whether genetic variants lead to a subtype of COPD subjects which share a set of phenotypic characteristics; as such, it does not attempt to determine the specific genotypic-phenotypic variables whose relationship leads to a significant association. Whether one of these approaches - association analysis with individual phenotypic characteristics, or with subtypes of subjects- is superior in identifying replicated genetic associations, or whether the approaches are separately informative, remains to be seen.
Our study has several strengths. First, we used relatively unbiased methods, in both factor analysis and cluster analysis, to select uncorrelated variables and determine severe COPD subtypes using the rich set of phenotypic and quantitative measures available in NETT. Second, our analysis is the largest reported cluster analysis using CT phenotypic variables. Third, despite our homogeneous study population, we were able to discern emphysema subtypes, which differed on variables not used to perform clustering. While all four of these subtypes have not previously been identified, our emphysema and airway-predominant clusters are consistent with a priori defined subtypes used in previous studies. Importantly, recent evidence shows that airway wall thickening and emphysema aggregate independently in families of individuals with COPD, suggesting that recognizing these differences may be important for discovering genetic associations.
Our results should be regarded as exploratory for several reasons. First, our dataset was based on available NETT data. Specific relationships between variables - for example, the high correlation between apical and total emphysema - may be due to selection biases of the NETT population. NETT subjects were likely biased towards those without predominant airway disease, and CT scans were suboptimal for assessment of airway wall remodeling due to the thicker slices associated with pre-MDCT (multi-detector CT) imaging. Similarly, our genotypic data was limited to a pre-specified subset of previous positive associations in candidate genes, and our cohort was limited to those enrolled in the NETT Genetics Ancillary Study (Additional File 1, Table S1). Our selection of phenotypic and genotypic variables for inclusion was strongly influenced by the limitations of available data, and decisions were made based on clinical judgement of relevance.
Second, our analysis also found that the separation of clusters was weak, indicating segmentation and not a true separation of these subtypes using clustering. Correspondingly, we found no strong evidence of smaller groups of more distinct subtypes. Furthermore, the small size of our clusters limits the power of association analysis, and our association with rs1800470 was not corrected for multiple comparisons. Given these limitations in this relatively homogeneous cohort, an attempt to validate these findings of specific subtypes using these or similar methods in other well-phenotyped COPD cohorts should be performed. Using a more heterogeneous and less selected group of subjects, in combination with improved radiographic measures, may result in more pronounced and distinct subpopulations.