Host lung gene expression patterns predict infectious etiology in a mouse model of pneumonia
© Evans et al. 2010
Received: 8 April 2010
Accepted: 23 July 2010
Published: 23 July 2010
Skip to main content
© Evans et al. 2010
Received: 8 April 2010
Accepted: 23 July 2010
Published: 23 July 2010
Lower respiratory tract infections continue to exact unacceptable worldwide mortality, often because the infecting pathogen cannot be identified. The respiratory epithelia provide protection from pneumonias through organism-specific generation of antimicrobial products, offering potential insight into the identity of infecting pathogens. This study assesses the capacity of the host gene expression response to infection to predict the presence and identity of lower respiratory pathogens without reliance on culture data.
Mice were inhalationally challenged with S. pneumoniae, P. aeruginosa, A. fumigatus or saline prior to whole genome gene expression microarray analysis of their pulmonary parenchyma. Characteristic gene expression patterns for each condition were identified, allowing the derivation of prediction rules for each pathogen. After confirming the predictive capacity of gene expression data in blinded challenges, a computerized algorithm was devised to predict the infectious conditions of subsequent subjects.
We observed robust, pathogen-specific gene expression patterns as early as 2 h after infection. Use of an algorithmic decision tree revealed 94.4% diagnostic accuracy when discerning the presence of bacterial infection. The model subsequently differentiated between bacterial pathogens with 71.4% accuracy and between non-bacterial conditions with 70.0% accuracy, both far exceeding the expected diagnostic yield of standard culture-based bronchoscopy with bronchoalveolar lavage.
These data substantiate the specificity of the pulmonary innate immune response and support the feasibility of a gene expression-based clinical tool for pneumonia diagnosis.
Pneumonias result in substantial mortality, causing more premature death and disability worldwide than any other disease . Unfortunately, while patient survival depends upon the rapid identification of infecting pathogens , the means for prompt and accurate diagnoses of pulmonary infections remain inadequate.
Despite widespread acceptance as the diagnostic tool of choice for unexplained pulmonary infiltrates [3–5], fiberoptic bronchoscopy with bronchoalveolar lavage (BAL) provides an unambiguous diagnosis in only 25-51% of cases [2, 4, 6–9]. The diagnostic utility of BAL is predicated on culturing pathogens from lavage effluent, without accounting for ongoing antibiotic therapy, non-pathogenic microbial colonization, or the technical challenge of navigating the bronchoscope into involved airways. Molecular techniques, such as antigen detection and polymerase chain reaction (PCR) testing, enhance BAL sensitivity for a subset of pathogens, but still often fail to explain infiltrates .
Often regarded as passive gas exchange barriers, the active responses of the lungs are critical to protection from infections. In the presence of inflammatory stimuli, the respiratory epithelia rapidly recruit inflammatory cells and undergo remarkable structural and functional changes [10–13], including the release of pathogen-specific antimicrobial products [14–16].
Even in the absence of an adaptive immune system, lower metazoans like Drosophila melanogaster selectively respond to different classes of microorganisms following pathogen detection with conserved pattern recognition receptors [17, 18]. Similarly, stereotyped pathogen-specific host innate immune responses are also observed from human dendritic cells , human monocytic cells [20–24], human endothelial cells , murine microglial cells , and murine jejunal epithelial cells . Based upon these multiply observed tailored responses and the inflammatory capacity of pulmonary epithelium [12, 28], we hypothesized that the lungs also respond selectively to different pathogens. In order to pursue the potential to achieve superior diagnostic utility in a timely manner, we interrogated this selective response to determine the etiology of pneumonias without reliance on culture data.
Unless otherwise specified, reagents were obtained from Sigma (St Louis, MO). All experiments were approved by the M. D. Anderson Cancer Center Institutional Animal Care and Use Committee. Specific pathogen free BALB/c mice were purchased from Harlan (Indianapolis, IN) and used in experiments at five to eight weeks old.
To achieve simultaneous exposure of large numbers of mice to respiratory pathogens, mice were placed in a nebulization chamber that was sealed except for an efflux limb that vented to a low resistance filter in a biohazard hood. An AeroMist CA-209 compressed gas nebulizer (CIS-US, Inc., Bedford, MA) was used to aerosolize pathogen suspensions, driven by 10 L/min of room air and supplemented with 5% CO2 to promote maximal ventilation and homogeneous exposure throughout the lungs, as we have previously described [29–32]. While it is conceivable that exposure of mice to increased inspired CO2 concentrations might alter gene expression, our experience supports prior reports that this promotes pathogen deposition in the lungs [33, 34], and our strategy involves differential gene expression analysis where all mice are exposed to the same CO2 environment, thus no differential effects should be detected.
For bacterial pathogens, the inocula were targeted to an LD75 by 48 h after infection. After growth to log phase, Streptococcus pneumoniae serotype 4, and Pseudomonas aeruginosa strain PA103 were each suspended in phosphate buffered saline (PBS) and delivered by aerosol. A standardized nebulization of 10 ml pathogen suspension over one hour to achieve the desired lethality required concentrations of approximately 1 × 1010 CFU/per ml S. pneumoniae and approximately 1 × 1011 CFU/ml of P. aeruginosa, as we have previously described [29, 31].
Because Aspergillus fumigatus is not lethal in non-immunosuppressed BALB/c mice, we delivered the maximal reproducible concentration of organisms as limited by viscosity. This dose was 1 × 109 conidia/ml, as determined using a standard hemacytometer. Conidia of strain Af293 were stored as frozen stock (1 × 109 conidia/ml) in 20% glycerol in PBS. One ml of stock was plated on yeast extract agar plates at 37°C in 5% CO2 for 3 days, then harvested by gentle scraping in PBS containing 0.1% Tween-20, and the suspension was filtered through 40 μm filters, centrifuged at 2,500 × g for 10 min, washed, resuspended in 10 ml PBS and aerosolized over 60 min, identical to the bacterial infections. To confirm both pulmonary deposition and infective capacity of the pathogen, additional mice were challenged with the same A. fumigatus protocol with or without prior cyclophosphamide and cortisol immunosuppression, as previously described .
A sham intervention group was treated with 10 ml PBS nebulized over 60 min under the same conditions used for infectious challenges.
At designated time points after infection, mice were anesthetized and their tracheas were exposed. BAL was performed and lavage effluent cytokine concentrations were determined by ELISA, as described [29, 30].
At designated time points after infection, gene expression microarray analysis was performed on lung homogenates from mice after challenge following leukoreduction by repeated BAL and vascular perfusion with sterile PBS [31, 32]. Lungs were excised and homogenized, total RNA was extracted, and amplified cRNA was hybridized to Illumina Sentrix Mouse-6 BeadChips (Illumina, Inc., San Diego, CA). All primary data were deposited at the NCBI Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/, accession GSE15869) consistent with MIAME standards (see Additional File 1).
To test the predictive ability of the gene expression data, three blinded investigators (SEE, MJT, BFD) were independently challenged to identify the infectious conditions based on gene expression patterns without reliance on culture data. After identifying characteristic changes for each condition in the gene expression analysis, the investigators were provided the data from only six transcripts that were each believed to be uniquely altered by one of the potential infectious conditions. In order to identify potentially discriminating transcripts and to assign cutoff values for a diagnostic panel we used two approaches. First, after confirming that there was no overlap of signal intensity between 2 standard deviations below a differentially upregulated gene and 2 standard deviations above the next highest condition for that transcript, we assigned a cutoff value for a positive test at 1 standard deviation below the mean signal intensity for the transcript in question. As a second approach, we created receiver operating characteristic (ROC) curves for each potentially discriminating transcript, selected from the list of differentially expressed genes. Two potentially predictive genes for each of the three infections were hand-selected for the panel, and the investigators were instructed to predict the pathogen based on the prestated rules (Additional File 2). Investigators were instructed to infer that a sample was from the sham group if the values did not meet criteria for one of the infections.
A computer algorithm was devised to automate the prediction of infecting organisms, based on the 18 h microarray data described above. The predictive model is a decision tree, with the first branch a decision between lungs infected with a bacterial pathogen and those not infected with bacteria. The sequential decisions are between S. pneumoniae and P. aeruginosa in the bacteria branch and between A. fumigatus and sham in the non-bacterial branch. Transcripts with predictive power to discern between branches were identified by fitting a linear model for each transcript, then the infectious condition of each blinded sample was sequentially predicted based on the expression of 1 to 21 discrete transcripts, with each transcript "voting" for one side of the decision tree (e.g., predicting either "bacterial" or "not bacterial"). To avoid ties when using majority vote rule, only odd numbers of predictor genes were allowed (see Additional File 1).
Over the course of 12 to 18 h, the total number of DEGs at FDR of 1 × 10-7 decreased to 367, but even greater condition-specific clustering was observed than at 6 h. Of these 367 DEGs, 179 were differentially expressed at both 6 h and 18 h time points. Notably, while the total number of DEGs decreased over time, the average fold-change of the remaining DEGs was generally increased.
Figure 3B demonstrates the temporal effect on gene expression in this model. The 30 most strongly differentially expressed genes at 18 h were analyzed at earlier time points, revealing progressive intensification of the gene expression patterns. Of these 30 DEGs, 18 were also differentially expressed at 6 h.
Manual review of the 367 DEGs identified unique transcript changes for each pathogen that were included in a predictive panel. Strategies using either the magnitude of differential expression or ROC curve performance were equally efficacious for defining the prediction rule cut-off values. As shown in Additional File 4, each included transcript yielded a cutoff that achieved 100% sensitivity and 100% specificity in the 18 h training set (i.e., area under the ROC curve = 1.0). Additional Files 2 and 5 show the panel of transcripts, the prediction rules, the data provided to the blinded investigators, and their predictions. Blinded review of 18 samples at 18 h after infection resulted in 100% correct categorization of infectious conditions for all three reviewers.
When we applied these prediction rules to 18 unique samples from a validation dataset, however, the prediction accuracy dropped to only 44.4%. As shown in Additional File 6, Additional File 7 and Additional File 8, there was congruity of the blinded investigators insofar as samples were most often either correctly predicted by all three investigators or incorrectly predicted by all three investigators. No statistically significant patterns emerged among the incorrect predictions.
Since a small panel of hand selected transcripts predicted infectious conditions as well or better than traditional cultures historically perform, we sought to automate the process of prediction. We devised a multiply branching decision tree algorithm that first separated bacterial infections (S. pneumoniae and P. aeruginosa) from non-bacterial conditions (A. fumigatus and sham). We identified 4,799 transcripts from the training set that could distinguish these two groups. Using our predetermined criteria for predictor transcripts, we found that Ccl4 (chemokine C-C motif ligand 4) performed most robustly, correctly classifying all training set samples as bacteria or non-bacteria. We also found individual transcripts with very high predictive accuracy for subsequent branches of the decision tree. Ccl3 (chemokine C-C motif ligand 3) expression always separated S. pneumoniae infection from P. aeruginosa in the training set. A single gene, Ttn (titin), discriminated between A. fumigatus and sham in 90% of the samples, reflecting all but one sample accurately categorized by the transcript. Notably, the sham sample that was inaccurately categorized as A. fumigatus by Ttn was also predicted to be A. fumigatus using multiple other transcripts, and inspection of the overall gene expression profile appeared more consistent with A. fumigatus than sham. This raises the possibility that the mouse was inadvertently or incidentally infected with fungus. If true, the Ttn-based categorization would be 100% correct for this branch point, as well.
The decision tree model was then tested against a unique (validation) set of gene expression data from lung homogenates collected 18 h after challenge. Using the same algorithm, the correct prediction of bacterial vs. non-bacterial status was made with 89% accuracy with 15 predictor genes and with 94.4% accuracy with 21 predictors (Figure 5B). Discrimination of S. pneumoniae vs. P. aeruginosa and of A. fumigatus vs. sham was achieved with >70% accuracy (Figure 5C and 5D). We again found that increasing the number of "voting" transcripts improved accuracy, with stabilization around 15 transcripts. The effect of adding additional predictor transcripts was minimal for separating the bacterial conditions from each other, but increasing from 3 to 15 transcripts correctly reclassified several samples from A. fumigatus to sham.
The informative value of host responses is increasingly recognized to differentiate between clinically confounding conditions . Markers of generic inflammation have been used for decades to hint at the presence of inflammatory and infectious diseases [36, 37]. More recently, host response elements have been studied to aid identification of life-threatening diseases, such as sTREM and procalcitonin in respiratory infections and sepsis [38–41]. Efforts are underway to characterize pulmonary conditions as diverse as interstitial lung diseases, pulmonary vascular diseases and asthma based on gene expression analysis [42–46]. Diagnostic host responses to Mycobacterium tuberculosis, are increasingly described [47–49]. Differential gene expression has been reported in the lungs following different infections  and gene expression profiling of leukocytes has been proposed to provide prognostic insights in the setting of lung infection . However, to the best of our knowledge, this report is the first to describe a means of identifying etiologic agents of infectious pneumonia based solely on the host gene expression response.
Because of the potential ease of sampling and abundance, we first sought to discriminate between infectious conditions based on BAL cytokine levels. Using a panel of 16 cytokines, P. aeruginosa-infected mice were consistently differentiated from the other three conditions. This is consistent with the recent report of McConnell and colleagues who found that a panel of 18 cytokines could discriminate P. aeruginosa-from S. pneumoniae-infected mice . However, while we identified a robust cytokine signature for one pathogen, we were unable to discern between the non-pseudomonal conditions by that method. Further proteomic analysis for non-cytokine host response elements may discriminate between the conditions, but our prior experience resolving low abundance peptides from BAL fluid  suggests that the technical challenges would offset the enhanced diagnostic capacity. Therefore, we elected to investigate host response specificity using gene expression analysis.
Our gene expression data suggest that host responses are sufficiently specific to discriminate between conditions that may be indistinguishable, such as different infectious pneumonias. While there appears to be a modest early peak of non-specific inflammation, we were surprised to identify such efficient discrimination by as early as 2 h after challenge. By 6 h after challenge, there was a robust response that waned in number of DEGs by 12 h, but clearly increased in signal amplitude of the persisting transcript changes. This durable signal increased to the 18 h time point and allowed for consistent blinded diagnoses. Remarkably, fewer than 10% of the 367 DEGs at 18 h were induced by more than one infectious condition (none by all three). Further, we found no evidence that the different infections simply induced the same gene expression patterns at different paces, rather each condition resulted in a unique gene expression profile. These findings attest to the high specificity of the host response. While the number of Aspergillus-regulated transcripts was low compared to the bacteria-induced DEGs, these findings are consistent with the finding of DeGregorio, et al. , and of Huang, et al. , when investigating fungus-induced gene expression in Drosophila and in human dendritic cells, respectively. Based on these results, we hypothesize that human lung gene expression patterns on clinical biopsy specimens will demonstrate similar specificity.
In order to systematize the otherwise subjective process of pattern-identification and to automate the process for efficiency, we devised a computerized algorithm to test whether gene expression data could predict subjects' infectious states. From a practical perspective, this strategy allowed simultaneous assessment of massive numbers of transcript permutations. More importantly, it provided diagnostic accuracy far better than that typically encountered clinically with traditional culture-based diagnostic strategies, and outperformed diagnostic predictions based on gene expression of hand selected transcripts.
The algorithm was intentionally structured as a decision tree. This allows for determination of the most relevant questions first, for sequentially increasing refinement of answers, and for the flexibility to add new branch points. In this case, based differences in available treatment options, we felt the most clinically important issue was to differentiate subjects with bacterial pneumonia from those without bacterial pneumonia. The program provided great accuracy in answering this question. The model was also robust for the secondary questions, though less so.
Typical of preliminary investigations, these data have limitations to their generalizability. Comparison of three organisms from different pathologic classes makes it impossible to know whether the effects observed are species-specific or broader effects of the group. This will be assessed in future comparisons to other members of the same classes. By design, the decision tree algorithm allows for exactly this type of modification. It is also possible that some of the gene expression changes observed in the A. fumigatus-infected animals may represent the effects of their immunosuppression. Given the clinical focus of our cancer center, future studies of potential drug effects will be a high priority.
Another advantage of interrogating gene expression profiles in suspected pneumonia is that it allows somewhat compartmentalized analyses of different cellular elements of the host response. Because the lungs were leukoreduced by bronchoalveolar lavage and vascular perfusion, the data presented here largely reflect responses of the epithelium. Expression patterns from simultaneously harvested alveolar macrophages will be separately analyzed and presented. Although the cellular purity is incomplete, this approach may be viewed as a preliminary model of the clinical situation where RNA can be separately obtained from epithelial cells by brushing and from alveolar macrophages by BAL. Such discrete analyses may be applied to identifying etiologies of other pulmonary condition, as well.
It could be argued that the samples were harvested sooner after initiation of infection than would be clinically possible. However, our model causes diffuse and uniform infection of the lungs, whereas clinical pneumonia generally begins with a localized infection that progresses spatially and temporally. Therefore, a clinical specimen harvested from the most recently involved lung segments will also be newly infected. Further, our observations of increasing signal intensity over time suggest that a durable diagnostic pattern will be identifiable at later stages. This will require confirmation in future, longer term studies.
The early and accurate diagnosis of the etiology of pneumonia would be of great clinical benefit. These findings suggest that it may be feasible to harness the host response to inform clinicians of a patient's infectious state when pneumonia is suspected. We anticipate that this will allow for development of a clinically-relevant tool, as well as providing new insights into differences between normal and ineffective host responses to infections.
The authors wish to thank Dr. Molly Bray, Baylor College of Medicine, for her assistance with the performance of the microarray studies.
This work was supported by the National Institutes of Health [KL2 RR02419] and institutional funds from the University of Texas M. D. Anderson Cancer Center to Dr Evans. Bioinformatics resources for this work was supported by a Cancer Center Support Grant [P30 CA016672] to the University of Texas M. D. Anderson Cancer Center. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.