A probabilistic model of biological ageing of the lungs for analysing the effects of smoking, asthma and COPD

Background Although a large body of literature is available that describes the effects of smoking, asthma and COPD on lung function, most studies are restricted to a small age range and to one factor. As a consequence, available results are incomplete and often difficult to compare, also due to the ways the effects are expressed. Furthermore, current approaches consider one type of measurement only or several types separately. Methods We propose a probabilistic model that expresses the effects as number of years added to chronological age or, in other words, that estimates the biological age of the lungs. Using biological age as a measure of the effects has the advantage of facilitating the understanding of their severity and comparison of results. In our model, chronological age and other factors affecting the health status of the lungs generate biological age, which in turn generates lung function measurements. This structure enables the use of multiple types of measurement to obtain a more precise estimate of the effects and parameter sharing for characterization over large age ranges and of co-occurrence of factors with little data. We treat the parameters that model smoking habits and lung diseases as random variables to obtain uncertainty in the estimated effects. Results We use the model to investigate the effects of smoking, asthma and COPD on the TwinsUK Registry. Our results suggest that the combination of smoking with lung disease(s) has higher effect than smoking or lung disease(s) alone, and that in smokers, co-occurrence of asthma and COPD is more detrimental than asthma or COPD alone. Conclusions The proposed model or other models based on a similar approach could be of help in improving the understanding of factors affecting lung function by enabling characterizations over large age ranges and of co-occurrence of factors with little data and the use of multiple types of measurement. The software implementing the model can be downloaded at the first author’s webpage.


Introduction
Smoking, asthma and Chronic Obstructive Pulmonary Disease (COPD) are the primary risk factors for lung function impairment in adults. Their average effects on the lungs are commonly estimated by measuring reduction in spirometric values with respect to a population of healthy individuals [1][2][3][4][5][6][7]. Due to the difficulty of collecting large sample size data spanning the entire adulthood, most studies are restricted to a small age range and to one factor. As a consequence, overall ages and combined effects are reported only in a few studies or are still missing and results from multiple studies are often difficult to compare, also due to the ways the effects are expressed. Furthermore, current approaches consider one type of measurement only, or several types separately (mostly Forced Expiratory Volume in 1 second (FEV 1 ) or Forced Vital Capacity (FVC)) -a combined analysis of several types of measurement could potentially provide a more precise quantification of the effects. http://respiratory-research.com/content/14 /1/60 In this paper we address these issues by taking the viewpoint that reduced pulmonary function corresponds to premature ageing of the lungs: we propose a model that expresses average FEV 1 and FVC reduction in individuals that smoke and/or have asthma and/or COPD in terms of number of years that are added to the lungs, or, in other words, we propose a model that estimates biological ageing of the lungs.
Biological age has been studied mainly at the whole body level (see [8][9][10] for recent references). At the respiratory system level, it was first introduced in [11] as a potentially more powerful type of information than spirometric values in motivating smokers to quit. Since then, several studies have investigated this hypothesis [12,13], using as biological age of a smoker the chronological age of a non-smoker of same height, gender and average FEV 1 obtained from predictive equations. This approach was designed to estimate the specific effect of smoking on a single individual rather than the average effect on an entire population, which is the interest of this paper.
We propose a generative probabilistic approach that explicitly represents biological age using an unobserved random variable -an adjustment of chronological age induced by factors that have an impact on the health status of the respiratory system such as smoking habits, lung diseases, environmental and genetic factors, etc. Our generative approach enables us to integrate multiple aspects of the problem into a single consistent framework, which allows the use of multiple types of measurement as well as sharing of information and therefore estimation with little data. The probabilistic approach enables us to deal with uncertainty and noise in the data. Furthermore, it allows us to treat the parameters that model smoking habits and lung diseases as random variables and therefore to obtain uncertainty in the estimated effects of such factors on the lungs. We evaluate our model on a subset of the TwinsUK Registry [14]. The dataset contains FEV 1 and FVC measurements of several individuals along with information about smoking habits, asthma, COPD, and height. By examining the posterior distributions of the parameters that model the combinations of smoking, asthma and COPD, and the posterior distributions representing the biological age associated to each combination, we are be able to make general and age-specific quantitative statements about the effects of these factors.

Methods
The TwinsUK Registry is a cohort of about 12000 twins aged 16 to 100 years from all over the United Kingdom used to study heritability and genetics of age-related diseases. It includes clinical, physiological, behavioural and lifestyle data collected since 1992 either at visits to the Department of Twin Research at King's College London or via self-administered questionnaires. For historical reasons, it encompasses predominantly females in the age range 45-65 years.
For the study, we considered female individuals with spirometry data collected between 1992 and 2010 and with recorded height. Males were excluded as their number was too small to enable reliable estimation of model parameters.
The study was approved by the St Thomas' Hospital Research Ethics Committee, and all twins provided signed informed consent, in accordance with the Helsinki Declaration.

FEV 1 -FVC measurements
Spirometry tests (model 2150; Vitalograph; Buckingham, England) were performed during visits (up to five for each individual) to the department. During each test, three FEV 1 -FVC measurements were recorded and the one corresponding to maximum FEV 1 was selected. The measurements were included in the study if in normal range, identified as between 0.5 and 7.0 litres based on [15,16]. More information can be found in [17].

Smoking status
We considered the subset of individuals that responded consistently in different smoking-related questionnaires between 1992 and 2010 (maximum of 13 questionnaires and 52 types of question). For such individuals, only those FEV 1 -FVC measurements for which one of the following two conditions held were included in the study: • The individual reported to have never smoked either cigarettes, cigars or pipes in a questionnaire completed in the same (or a subsequent) year in which the measurement was recorded. • The individual reported to be a smoker in a questionnaire completed in the same year in which the measurement was recorded.
As the same condition was satisfied for all retained measurements from the same individual, an overallmeasurement non-smoker or smoker status could be assigned to each individual.

Asthma and COPD status
We considered the subset of individuals that responded consistently in different asthma-related questionnaires between 1992 and 2010 (maximum of 8 questionnaires and 4 types of question). Such individuals were classified as non-asthmatic if they reported to have never suffered from asthma and as asthmatic otherwise. Diagnosis by a doctor was not always explicitly required. A similar procedure was used to determine COPD status.
All possible combinations of smoking, asthma and COPD status give rise to 8 groups (see Table 1 where H http://respiratory-research.com/content/14/1/60 stands for healthy with respect to smoking, asthma and COPD). Only individuals of known combined status, and therefore group, were included in the study. In order to eliminate potential bias in estimating the effects of smoking, asthma and COPD due to correlation between twins and multiple visits, with the exception of Group H, we disregarded at random one twin for twins belonging to the same group and retained only the most recent FEV 1 -FVC measurement for individuals with multiple visits. Group H, which contains a considerable number of datapoints and should therefore not be heavily affected by this correlation, was excluded as accurate estimation of parameters b (see (3)) requires a large amount of data. These filtering steps are summarized in Table 2. The final dataset encompassed 4403 FEV 1 -FVC measurements taken from individuals in the age range 18.3-82.8 years (the age of an individual, calculated from birth date and date of measurement, is expressed in decimals of year by considering 365.25 days per year). The total number of measurements and the age range of each group are indicated in Table 2. The histogram representing the number of FEV 1 -FVC measurements available at different ages is given in Figure 1. The number of measurements available for Group H at age ranges 18-44, 45-64 and 65-83 is respectively 871, 2221 and 650.
Our classification does not take into account the degree of severity of asthma, COPD and smoking. Therefore, the estimated effects have to be interpreted as corresponding to the most likely degree. We are also limited by our definition of asthma and COPD, which potentially includes individuals with a self-reported diagnosis. Finally, whilst the definition of non-smoker and smoker is based on the year in which the FEV 1 -FVC measurement was taken, this is not the case for asthma and COPD, as we do not have precise timing information about these diseases. We nevertheless expect little error due to this as each individual answered the questionnaires multiple times.

Definition of biological age
Before describing the proposed model in details, we define biological age and highlight key points that guided us in the construction of the model. Figure 2 illustrates the concept of biological ageing for smokers (Group S) relative to the reference population of healthy individuals (Group H) based on FEV 1 . As we can see from the measurements (Figure 2(a)), smokers have on average lower FEV 1 than healthy individuals. This becomes clearer when looking at the measurement means ( Figure 2(b)), which are averages computed over an 11year sliding window to enforce smoothness over ages. For example, smokers' mean at age 60 (computed from age interval 55-65) is equal to that of healthy individuals at age 68. It is therefore reasonable to define smokers' biological age at chronological age 60 as approximately 68 years.  That is, biological age is defined to be the chronological age of the healthy population corresponding to the same lung function mean. This is the population level analogue of the individual level definition introduced in [11][12][13].
A straightforward approach to estimating biological ageing would be to compute differences in average FEV 1 decline between healthy individuals and smokers by fitting two separate lung function models (such a separate approach was used for example in [1,18]), and subsequently deduce biological ageing from these differences. We can use, for example, the model in [15] first proposed in [19], which is considered an accurate predictor of lung function in adults. In this model, the relationship between the log of the nth lung function measurement, l n , chronological age, a n , and height, h n , is given by the following equation: } is a set of unknown model parameters (modelling the log of the measurement, rather than the measurement, makes the model linear in b and therefore simplifies its estimation). By computing two separate sets b, one for healthy individuals and one for smokers, we can obtain the average FEV 1 decline for the two populations, as shown in Figure 2(c) for individuals of average height (1.62 metres). From such estimates we can deduce smokers' biological ageing, as shown in Figure 2(d).
This simple approach has several limitations. It cannot produce reliable estimates of b for the groups of small size (all groups other than Groups H and S). A single model of all groups in which some parameters are shared among them would alleviate this problem. Linear regression models that include factors such as smoking and lung disease as covariates, e.g. [20], have this property but are limited to additive combinations of effects.
Furthermore, it is not clear how to consider multiple types of measurement, such as FEV 1 and FVC, to obtain a more precise estimate of biological age. If two separate models for FEV 1 and FVC are fitted, the inferred biological ages need to be combined into a single estimate. Simply taking the average (as investigated in [11]) is not optimal as for example, for young ages for which differences between healthy individuals and smokers are absent in FVC (see Figure 3), only FEV 1 should be considered. An approach that estimates biological age from simultaneous modelling of FEV 1 and FVC would overcome this difficulty.
Finally, a probabilistic approach would better deal with noise in the data and would allow to obtain uncertainty in the estimated biological ages, which is particularly important when little amount of data is available.

A probabilistic model of biological age
Our approach to taking into account the observations above is to define a probabilistic model with an explicit unobserved random variable representing biological age. This variable is an adjustment of chronological age due to smoking habits, lung diseases, environmental and genetic factors, etc., namely all factors that have an impact on the health status of the respiratory system. Biological age combined with other factors that do not affect the health status of the respiratory system but heavily influence lung function measurements, namely height and measurement noise, generate FEV 1 and FVC.
More specifically, our probabilistic model is defined by the following equations: a n = u c n a n + v c n + n , n ∼ N (0, σ 2 a ), In these equations, l n is a two-dimensional column vector containing the log of the nth FEV 1 -FVC measurement (n indexes the measurement rather than the individual, as in Group H each individual can have more than one measurement), a n is the chronological age of the corresponding individual, h n is the height,ã n is the biological age, c n is a discrete variable representing the group to which measurement n belongs (c n ∈ {1, . . . , 8} corresponding to {Group H, Group A, Group C, Group AC, Group S, Group SA, Group SC, Group SAC}), and σ 2 a , b i (i = 1, . . . , 4) and l are unknown deterministic parameters.
Biological ageã n is generated as a group-dependent linear transformation of chronological age a n , u c n a n + v c n , with the addition of a Gaussian term n . The term n http://respiratory-research.com/content/14/1/60 represents the modification to chronological age that is specific to the nth measurement and not captured at the group level, and therefore also includes all unmeasured factors such as environmental and genetic factors.
Log-measurement l n is obtained as a nonlinear transformation of biological ageã n and height h n (of the same form as (1)), to which a Gaussian noise term η n is added. The term η n is drawn from a two-dimensional Gaussian with non-diagonal covariance matrix l , which accounts for the high correlation between FEV 1 and FVC. The parameters b i (i = 1, . . . , 4) are two-dimensional column vectors that model age-related decline of FEV 1 and FVC. They are estimated from healthy individuals only to ensure that they describe lung function decline in the absence of smoking, asthma and COPD. These parameters are common to all groups, which is crucial in enabling the inclusion of groups with a small number of available datapoints. The generative process induced by the model is depicted in Figure 4, where empty nodes indicate unknown quantities, whilst filled nodes indicate known quantities.
The linear transformation of chronological age contains both a slope u c n and an intercept v c n . The slope u c n determines the rate at which biological age changes with chronological age. Only positive values of u c n are to be expected as they indicate that biological age increases with chronological age: u c n = 1 indicates an increase rate of one year per year, whilst u c n > 1 (< 1) indicates an increase rate higher (lower) than one year per year. For example, Figure 2 p(ã n |a n , c n , u c n , v c n , σ 2 a ) = N (u c n a n + v c n , σ 2 a ) , where the symbol T indicates the transpose operator and I is the identity matrix. To simplify the notation, in the rest of the paper we omit conditioning on all quantities that are not treated as random, namely μ, , a n , c n , σ 2 a , h n , b, l , and therefore denote the three basic Gaussian density functions defining the model as p(u c n , v c n ), p(ã n |u c n , v c n ) and p(l n |ã n ).

Inference
In order to make deductions about the effects of smoking, asthma and COPD, we need to infer the posterior distributions of the group parameters given all N measurements, p(u j , v j |l 1 , . . . , l N ) (j = 1, . . . , 8), and the posterior distributions describing the biological age of each group at chronological age a, p(u j a + v j |l 1 , . . . , l N ). An analysis of p(u j , v j |l 1 , . . . , l N ) enables us to make general (summarized over all ages) statements about the groups: lack of or small overlap of some of these distributions indicates fundamentally different biological ageing of the corresponding groups. An analysis of p(u j a + v j |l 1 , . . . , l N ) enables us to make statements which are specific to age a.
As explained above, we treat u j and v j as a priori independent random variables with Gaussian distributions. The joint posterior distribution factorizes as where {l n |c n = j} denotes the subset of measurements belonging to group j. The factors p(u j , v j |{l n |c n = j}) have unknown analytical form, as the transformation from the biological age to the measurements (3) is nonlinear. We estimated them numerically and found that they are all indistinguishable from Gaussian density functions. As a consequence, we also found that p(u j a + v j |{l n |c n = j}) are Gaussian. A detailed explanation of how to estimate these posterior distributions is given in the Appendix

Results
In the next two sections we analyse the posterior distributions p(u j , v j |{l n |c n = j}) and p(u j a + v j |{l n |c n = j}) obtained when fitting the proposed model to our dataset. Figure 4 Probabilistic model of biological age. Generative process induced by our model. The two plate sections indicate that the enclosed structures are repeated for all 8 groups and N measurements. Combined smoking, asthma and COPD status c n (through parameters u c n , v c n and the addition of a noise term n representing influence of smoking habits, lung disease, and unmeasured factors such as environmental and genetic factors that are specific to the nth measurement) transforms chronological age a n into biological ageã n . Biological ageã n and height h n generate (through parameters b and the addition of a noise term η n representing measurement noise) lung function measurement l n . http://respiratory-research.com/content/14/1/60 Analysis of posterior distributions p(u j , v j |{l n |c n = j}) Figure 5(a) shows the contour plots of p(u j , v j |{l n |c n = j}): each ellipse is centred at the mean and encloses 95% of the distribution.
We can notice that the posterior distributions have different spread, depending on the combined effect of number and dispersion of measurements. For Group H (continuous-blue ellipse), the high number of available measurements makes the distribution highly peaked around u 1 = 1, v 1 = 0, despite the high dispersion at each age (see Figure 2(a) and Figure 3(a)). This highlights an important point about how to interpret the posterior distributions: they provide us with a measure of uncertainty on the estimated average biological ageing. Thus, even if dispersion at each age is high, the model can still be certain about the average biological age.
The major axes of the ellipses all have very similar directions, expressing the fact that increasing the slope u j requires decreasing the intercept v j and vice-versa. This means that samples from the posterior distributions give linear transformations of chronological ages intersecting at middle ages, as shown in Figure 5(b) for Groups H and SC. In other words, uncertainty about biological age is higher at young and old ages than at middle ages, which is what we would expect from the distribution of measurements shown in Figure 1.
With the exception of Group C (continuous-green ellipse) for which there is small overlap, unhealthy groups do not overlap with Group H indicating that biological ageing differs from chronological ageing.
If we consider Group A (continuous-red ellipse) versus Group SA (dashed-red ellipse), Group C versus Group SC (dashed-green ellipse), and Group AC (continuous-cyan ellipse) versus Group SAC (dashed-cyan ellipse), we can see that the ellipses do not overlap (considerably) and that the centre of the smoking ellipse is closer to the upperright corner than the centre of the non-smoking ellipse, which means that smoking in addition to having lung disease(s) induces significant increase in ageing with respect to having lung disease(s) alone. The fact that Group S (dashed-blue ellipse) does not overlap with Groups SA, SC and SAC and is closer to the lower-left corner signifies that this increase in ageing is not due to smoking alone but is a truly combined effect. We can therefore conclude that the combination of smoking with lung disease(s) has more severe effect on ageing than lung disease(s) alone. Lack of overlap despite the very small number of available measurements, which causes considerable spread of some of these distributions, makes us confident about this conclusion.
Comparison of Groups A and C with Group AC and comparison of Groups SA and SC with Group SAC reveal the effect of co-occurrence of asthma and COPD versus either disease. Unlike the non-smoking case for which the large overlap does not enable us to draw conclusions, in the smoking case the posterior distributions indicate substantial increase in ageing in the co-occurrence of the diseases.

Analysis of posterior distributions p(u j a + v j |{l n |c n = j})
Figure 6(a) shows the standard deviations of p(u j a + v j |{l n |c n = j}). As discussed above, the standard deviations, and therefore uncertainties about the estimated effects, are lower at middle ages for which more measurements are available. Figure 6 the posterior distributions p(u j a + v j |{l n |c n = j}) at ages 20, 45, 55, 65 and 80 years: the length between two starts equals 2×1.96 times the standard deviation. Figure 7 illustrates the behaviour of the posterior distributions every 5 years: each rectangle is centred at the mean and its length equals 2×1.96 times the standard deviation. From these figures we can see that, at the extreme ages of 20 and 80 years for which the standard deviations are higher, some of the general conclusions made in the previous section are no longer valid. More specifically, at age 20 there is considerable overlap between Groups A and SA, between Groups C and SC, and between Groups AC and SAC. Therefore, it is not possible to deduce from the posterior distributions that the combination of smoking with lung disease(s) has more severe effect on ageing than smoking or lung disease(s) alone at this early age. Similarly, we cannot make conclusions about co-occurrence of asthma and COPD versus either disease. At age 80, Groups AC and SAC are significantly different, as are Groups S and SAC, so that we can conclude that the combination of smoking with asthma-COPD (with asthma-COPD we indicate co-occurrence of asthma and COPD) has more severe effect on ageing. However, this is not the case for asthma and COPD alone. Furthermore, we cannot conclude that the combined effect of asthma and COPD is higher than the single effects. By looking at the other ages, we can see that the full set of statements made in the previous section is valid for the age range 50-60.
Notice that the difference between Groups H and S is already significant at age 30. This shows that at young ages the model is considering FEV 1 measurements only to determine smokers' biological age, as desired (see discussion of Figure 3

above).
This age-specific analysis has enabled us to determine at which ages the general statements about differences in groups made in the previous section are valid. However, it also reveals an important difference between younger and older ages, namely that, with the exception of Groups A and C, means distances of unhealthy groups from Group H are substantially higher at older ages. Thus the effects of most combinations of factors seem to increase with age.
In Table 3 we give the estimated number of years that are added to chronological age (means ±1× standard deviations) for the age range 45-64. From the table we can make a final interesting observation: at age 50 the effect of combined smoking with asthma-COPD seems more severe than additive. Indeed, when considering 1.96 times the standard deviation, the sum of the maximum numbers of years added to chronological age in Groups S and AC is 23.8, whist the minimum number of years added in Group SAC is 23.6.

Discussion
To date, biological age of the lungs has been used at the individual level to investigate its effectiveness in motivating smokers to quit. In this paper, we have used biological age of the lungs at the population level to analyse the average effects of smoking, asthma and COPD on the health status of the respiratory system. As for the individual level case, knowing how much older, on average, the lungs of individuals that smoke and/or have lung disease(s) look relative to the healthy population enables a more immediate understanding of the impact of these factors on the health status of the lungs. However, with this work we have shown that modelling lung function through biological age has additional benefits.
Such a modelling enables to properly combine multiple types of measurement to obtain a more precise estimate of the health status of the respiratory system. We have seen that our approach correctly deals with the case in which lung function differences are not evident in one type of measurement.
Such a modelling also enables parameter sharing for characterization over large age ranges and of co-occurrences of factors with little data. We obtained results that are in agreement with the literature (see the next section) using a small amount of data. Furthermore, we could compare cases that have not been previously analysed, as non-smokers with asthma and COPD versus smokers with asthma and COPD.
By treating the parameters that model smoking and lung diseases as random variables, we could obtain uncertainty in the estimated effects of such factors on the lungs.
Finally, such a modelling enables more immediate interpretation and comparison of results within and among different studies than approaches expressing effects in spirometric values. Whilst we did not show that in this paper, the following examples can clarify this point. Suppose that Studies A and B find that FEV 1 mean value at age 60 in the healthy population is respectively 2.75 and 2.5 litres, and that both studies find that FEV 1 mean value at age 60 in the smoking population is 2.25 litres. One has to consider the mean values of the healthy populations to understand that Study A estimates that smoking has a stronger effect than Study B. On the other hand, this would be immediately evident if biological age was used, since the estimated number of years added to chronological age in smokers in Study A would be higher than in Study B. As another example, consider investigating whether the effect of smoking on pulmonary function in females and males is different (published results on this subject are controversial [21][22][23][24][25][26][27]). Whilst our analysis was restricted to females, males can be easily included in the http://respiratory-research.com/content/14/1/60 model e.g. by having separate sets of parameters b, u and v so that only noise covariances are shared between genders. Similarly to the previous example, if spirometric values are compared as in current studies, the values of healthy males and females need to be considered to understand whether the impact of smoking is gender specific, whilst this is not the case if biological age is used, as biological age is a measure that is relative to the healthy population.
One limitation of the proposed model is that it does not account for longitudinal and twin structure, so that we had to exclude many datapoints from the analysis. We are currently investigating an extension that incorporates both types of structure by adding Gaussian terms which are shared across ages and twins.
The choice of modelling biological age as a linear transformation of chronological age, as defined in (2), was motivated by simplicity and supported by Figure 2(d). This figure indicates that smokers' biological age is well described as a linear transformation and makes it reasonable to expect that linear or piecewise linear http://respiratory-research.com/content/14/1/60 Figure 7 Posterior distributions p(u j a + v j |{l n |c n = j}) over all ages. Posterior distributions p(u j a + v j |{l n |c n = j}) for age a in the range 20-80 years at 5-year step-size. Each rectangle is centred at the mean and its length equals 2×1.96 times the standard deviation.
transformations should be valid transformations for the other groups too. As the size of our dataset was too small to enable reliable estimation of piecewise linear transformations, we restricted ourselves to linear ones. However, piecewise linear transformations would be worthy of investigation in studies in which more datapoints are available.
The form of nonlinearity in (3) enabled us to describe lung function decline in adulthood quite accurately whilst keeping the model relatively simple. However, it would be worthy to also consider the more flexible case in which the form is estimated, particularly when considering other types of measurement in addition/replacement to FEV 1 and FVC. Some work in this direction, specifically addressing complex lung function growth in young individuals, has been done in [18] and in [16], which proposed the model of [28]. We are currently investigating modelling lung function decline with Gaussian radial basis functions.
Treating b as deterministic rather than random enabled us to use simple numerical integration for inference, avoiding the need to develop more complex approximation schemes. It is reasonable to assume that a posterior on b would be highly peaked (as is the posterior of u 1 , v 1 , p(u 1 , v 1 |{l n |c n = 1}), computed from the same individuals) and therefore that this choice had minor impact on the estimated uncertainties.
Finally, we would like to notice that, whilst the proposed model can also provide single individuals with biological age, such a usage of the model would require a careful analysis on how to set the measurement noise covariance l , as the maximum likelihood approach used in this paper could be a suboptimal.

Conclusions
We have introduced a probabilistic model based on the concept of biological age to analyse the effects of smoking, asthma and COPD on female lung function. Our approach enabled us to make statements over large age ranges and about co-occurrence of factors with little data.  We have found that co-occurrence of smoking with asthma or COPD or combined asthma and COPD has more severe effect on ageing than smoking, asthma, COPD or combined asthma and COPD alone. This is in agreement with the findings in [29], that suggest that the rate of decline of lung function is faster in smokers with emphysema than in ex-smokers with emphysema. This is also in line with the results in [4,20,30], which show that smoking has a strong additional ageing effect on individuals with asthma. To the best of our knowledge, results on co-occurrence of smoking with combined asthma and COPD have not been previously reported.
We have also found that co-occurrence of asthma and COPD has a more detrimental effect on the lungs than asthma or COPD alone. This is in line with recent studies that indicate a reduced quality of life in individuals with both asthma and COPD with respect to individuals that have only either disease [31][32][33].
By analysing differences among ages, we could conclude that, with the exception of asthma and COPD alone, the effects of the combinations of factors increase with age and therefore are more severe at older ages. This is in agreement with other studies, for example [4], in which it is shown that the effects of smoking and combined smoking with asthma increase with age, whilst the effect of asthma is constant.
The software implementing the model can be downloaded at the first author's webpage.

Appendix
Below we describe how to estimate the model parameters b, σ 2 a and l and the posterior distributions p(u j , v j |{l n |c n = j}) and p(u j a + v j |{l n |c n = j}). In order to avoid underflow/overflow problems, computations were performed in log-scale.

Parameter learning
As explained above, the parameter set b was estimated from the healthy group (Group H) only to make sure that it describes lung function decline in the absence of smoking, asthma and COPD. We learned the two subsets of b corresponding to FEV 1 and FVC separately using ordinary least squares. We then fixed b and estimated parameters σ 2 a and l using an Expectation Maximization (EM) approach [34]. More specifically, the EM approach consisted of iterating the following two steps until convergence: where · p(·) denotes averaging with respect to p(·) and p(ã 1 , . . . ,ã N , u 1 . . . , u 8 , v 1 , . . . , v 8 |l 1 , . . . , l N ) is computed using the values of σ 2 a and l estimated in the previous iteration.
The part of the expectation of the complete data log-likelihood that depends on σ 2 a and l is given by j {n|c n =j} log p(l n |ã n ) p(ã n |{l n |c n =j}) + log p(ã n |u j , v j ) p(ã n ,u j ,v j |{l n |c n =j}) .
We excluded the parameter set b from the EM approach as we found that otherwise the nonlinearity in FEV 1 and FVC decline with age of healthy individuals would be transferred to the biological age (through high noise variance σ 2 a ) so that b would not represent normal lung function decline. http://respiratory-research.com/content/14/1/60

M-Step: Updates for σ 2 a
Setting to zero the derivative of (4) with respect to σ 2 a j {n|c n =j} ∂ log p(ã n |u j , v j ) ∂σ 2 a p(ã n ,u j ,v j |{l n |c n =j}) we obtain the optimal σ 2 a σ 2 a = 1 N j {n|c n =j} (ã n ) 2 + u 2 j (a n ) 2 + v 2 j − 2 ã n u j a n − 2 ã n v j + 2 u j v j a n , where the required moments are estimated as explained below.

E-Step: Inference
From this distribution, the moments required for the parameter updates, namely ã n , (ã n ) 2 , (ã n ) 3 , (ã n ) 4 , ã n u j , ã n v j , u 2 j , v 2 j and u j v j , are computed by numerical integration.

Approximation
The EM approach for learning σ 2 a and l described above is time consuming. A comparison of this approach with an approximation in which u j and v j are considered as deterministic did not show any difference in the learned values of σ 2 a and l . We therefore used this approximation for the presented results. http://respiratory-research.com/content/14/1/60 In this alterative approach, the updates for σ 2 a and l in the M-Step are similar to the ones above in which the optimal values of u j and v j are used and p(ã n , u j , v j |{l n |c n = j}) becomes p(ã n |l n ), computed as p(l n |ã n )p(ã n )/ ã n p(l n |ã n )p(ã n ).
The optimal values of u j and v j are learned by setting to zero {n|c n =j} ∂ log p(ã n ) ∂u j p(ã n |l n ) ∝ {n|c n =j} ã n p(ã n |l n ) − u j a n − v j a n , {n|c n =j} ∂ log p(ã n ) ∂v j p(ã n |l n ) ∝ {n|c n =j} ã n p(ã n |l n ) − u j a n − v j , that is, by solving the following linear system: {n|c n =j} (a n ) 2 {n|c n =j} a n {n|c n =j} a n N j u j v j = {n|c n =j} ã n p(ã n |l n ) a n {n|c n =j} ã n p(ã n |l n ) , where N j indicates the number of measurements belonging to Group j.

Computing the effects of smoking, asthma and COPD
The posteriors distributions p(u j , v j |{l n |c n = j}) can be computed from (5) by numerical integration overã n . The posteriors distributions p(u j a + v j |{l n |c n = j}) can be computed from p(u j , v j |{l n |c n = j}) using the formula of linear transformation of random variables and numerical integration. However, as we found numerically that p(u j , v j |{l n |c n = j}) are Gaussian, p(u j a + v j |{l n |c n = j}) can be computed more simply using the formula of linear transformation of Gaussian random variables. A transformation of p(u j , v j |{l n |c n = j}) was performed to correct the small deviation of the mean of p(u 1 , v 1 |{l n |c n = 1}) from (1,0).