A rapid NGS strategy for comprehensive molecular diagnosis of Birt-Hogg-Dubé syndrome in patients with primary spontaneous pneumothorax

Background Primary spontaneous pneumothorax (PSP) or pulmonary cysts is one of the manifestations of Birt-Hogg-Dube syndrome (BHDS) that is caused by heterozygous mutations in FLCN gene. Most of the mutations are SNVs and small indels, and there are also approximately 10 % large intragenic deletions and duplications of the mutations. These molecular findings are generally obtained by disparate methods including Sanger sequencing and Multiple Ligation-dependent Probe Amplification in the clinical laboratory. In addition, as a genetically heterogeneous disorder, PSP may be caused by mutations in multiple genes include FBN1, COL3A1, CBS, SERPINA1 and TSC1/TSC2 genes. For differential diagnosis, these genes should also be screened which makes the diagnostic procedure more time-consuming and labor-intensive. Methods Forty PSP patients were divided into 2 groups. Nineteen patients with different pathogenic mutations of FLCN previously identified by conventional Sanger sequencing and MLPA were included in test group, 21 random PSP patients without any genetic screening were included in blinded sample group. 7 PSP genes including FLCN, FBN1, COL3A1, CBS, SERPINA1 and TSC1/TSC2 were designed and enriched by Haloplex system, sequenced on a Miseq platform and analyzed in the 40 patients to evaluate the performance of the targeted-NGS method. Results We demonstrated that the full spectrum of genes associated with pneumothorax including FLCN gene mutations can be identified simultaneously in multiplexed sequence data. Noteworthy, by our in-house copy number analysis of the sequence data, we could not only detect intragenic deletions, but also determine approximate deletion junctions simultaneously. Conclusions NGS based Haloplex target enrichment technology is proved to be a rapid and cost-effective screening strategy for the comprehensive molecular diagnosis of BHDS in PSP patients, as it can replace Sanger sequencing and MLPA by simultaneously detecting exonic and intronic SNVs, small indels, large intragenic deletions and determining deletion junctions in PSP-related genes.

BHD syndrome is an autosomal dominant disease characterized by skin fibrofolliculomas, renal cancer, pulmonary cysts and spontaneous pneumothorax [12][13][14][15]. Since the clinical manifestations may be atypical and variable [16,17], BHD patients could exhibit a pneumothoraxdominant phenotype with no or reduced penetrance of skin or renal manifestations, the syndrome is considered to be under-diagnosed [18][19][20][21][22][23]. FLCN mutation carriers have been reported to have an increased lifetime risk of developing renal cell carcinoma (RCC) [24,25], they should be advised to take some preventive measures, largely aimed at early recognition and treatment of RCC. Under the circumstances, it is an imperative diagnostic approach to perform genetic testing in PSP patients [17].
Thus far, the molecular diagnosis of BHD syndrome is mainly based on Sanger sequencing. However, there are different types of mutations in FLCN gene, comprised of base pair substitutions, small indels, deletions and duplications [26][27][28], which make the diagnostic procedure complicated and inefficient. Moreover, for differential diagnosis, patients with PSP should also be screened for mutations in FBN1, COL3A1, CBS, SERPINA1 and TSC1/TSC2 genes. Together, this requires about 220 primer reactions in Sanger sequencing, plus at least 7 probe sets in Multiple Ligation-dependent Probe Amplification (MLPA) analysis.
Given the complicity, it is not feasible to conduct a comprehensive mutation screening in clinical diagnostics in PSP patients. In recent years, next generation sequencing (NGS) and target enrichment techniques have been further developed, allowing us to focus specifically on genomic regions of interest for cheaper multiplexed sequencing of more cases. It has been successfully used in diagnosis of genetically heterogeneous diseases, such as breast cancer and congenital muscular dystrophy [29,30].
Herein we sought to develop a NGS-based method using Haloplex target enrichment and the MiSeq platform to identify not only point mutations and indels but also copy number variations (CNV) as an accurate and easier diagnostic tool for simultaneous mutation screening of all the PSP-related genes. Using little input DNA, this system promises a quicker, more affordable and efficient analysis in diagnostic laboratories. To this end, we used Haloplex to screen 7 genes responsible for PSP in 40 PSP patients and aimed to evaluate its performance in identifying SNPs, indels and copy number variations affecting target genes.

Patients
Forty unrelated patients with PSP who had given written informed consent participated in the genetic screen and were divided into two groups.

Test group
Nineteen patients with different pathogenic mutations of FLCN previously identified by conventional Sanger sequencing and MLPA were included in test group (patients F01 to F19) [7,8].

Blinded sample group
Twenty-one random PSP patients without any genetic screening were included in blinded sample group (patients F20 to F40).
Subjects were recruited from two tertiary hospitals, Taizhou Hospital of Zhejiang Province and Nanjing Chest Hospital. All participants in this study completed a medical questionnaire. The presence of lung cysts was evaluated by chest computed tomography. Medical and surgical records were reviewed if available. Peripheral blood samples were collected for DNA extraction and subsequent analysis.

Statement of ethical approval
The study was approved by the Ethical Committees of Taizhou Hospital of Zhejiang Province (TZYY-2011106) and the Ethical Committees of Nanjing Chest Hospital (NCH-2011062). Informed consent was obtained from all individual participants included in the study.

DNA extraction, quantification, and quality control
Genomic DNA was extracted from peripheral blood samples of collected family members and controls using a Qiagen DNA Mini blood Kit (Qiagen, Hilden, Germany) according to manufacturer instruction. DNA purity and concentration were assessed by the NanoDrop2000 spectrophotometer (Thermo Fisher Scientific, Florida, USA), A260/A280 ratios of 1.8 to 2.0 were accepted. DNA fragmentation was assessed by agarose gel electrophoresis.
HaloPlex design, target enrichment, and next generation sequencing A custom Haloplex panel was designed using Agilent's online SureDesign tool (https://earray.chem.agilent.com/ suredesign/index.htm). In all, 78 target regions including the whole gene sequences of FLCN, TSC1, TSC2, CBS, COL3A1 and all coding exons, intron-exon boundaries including 50 intronic nucleotides and 5′ UTR (Untranslated Regions), 3′ UTR of FBN1, SERPINA1 were captured for enrichment (Table 1). According to previous reports, intragenic large deletions have been identified in FLCN, TSC1, TSC2, CBS and COL3A1 genes [8,[31][32][33], most breakpoints of the deletions locate in intronic sequences. In order to assess the performance of breakpoint analysis using NGS system, we captured both exonic and intronic sequences. However, large deletions in SERPINA1 gene were rarely reported. Blyth et al. (2008) suggested that Marfan patients with large deletions in FBN1 gene have a more severe phenotype, and barely manifest as isolated pneumothorax [34]. Therefore, we only targeted coding exons, intron-exon boundaries, 5′-and 3′-UTRs of SERPINA1 and FBN1genes. The predicted target coverage was 98.72 %. The 214 kb target regions were captured using the Agilent HaloPlex Target Enrichment System Kits for Illumina Sequencing (Custom Panel Tier 1, ILM, 48 reactions; Agilent Technologies, Inc. Santa Clare, CA) following Agilent protocols. In brief, 225 ng of genomic DNA was fragmented using eight different restriction enzymes and denatured. Hybridization was performed with a biotinylated custom probe library and sample indexes, and then the target DNA fragments were captured with magnetic streptavidin beads and circularized by ligation. The captured target libraries were amplified by PCR, quality controlled and quantified using the BioAnalyzer 2100 (Agilent Technologies, Inc. Santa Clare, CA). Equimolar amounts of differentially indexed samples were pooled before pair-ended sequencing 300 bp on the Illumina MiSeq platform (Illumina Inc., San Diego, CA, USA) and the mean read depth within the regions of interest was~300 reads per base.

Bioinformatics analysis for NGS results
The authors implicated in the bioinformatics analysis had no information on the patient data. Post-run sequencing quality was assessed by FastQC (Babraham Bioinformatics, Cambridge, UK). Sequence reads were aligned to the hg19 human reference genome (February 2009 assembly), and analyzed with Agilent software Surecall v2.1.1.13 using the default Haloplex parameters. The mutation reports (*.vcf files) were then annotated using ANNOVAR [35]. After removing all SNVs/indels occurring within non-coding regions, the variations were filtered against dbSNP137 NonFlagged, 1000 genomes and ESP 6500 databases. Rare, non-synonymous, exonic variants were subjected to bioinformatics tool PolyPhen [36] and SIFT [37] to evaluate their possible impact on protein structure and function. To perform extensive analysis, we further analyzed the intron sequences to identify potential splice and branch-site mutations using splice prediction tool ESEfinder [38]. If any potentially harmful mutation identified, validation was performed by Sanger sequencing.

Copy-number analysis
CNVs were identified using a depth-based method. Coverage statistics were generated using GATK Depth of Coverage tool version 3.1 [39,40]. We normalized coverage in each sample by dividing the average coverage of each exon by the total number of on-target mapped reads for that sample ðnormalized; coverage ¼ the average coverage of each exon the total number of on−target mapped reads for that sample Þ . To identify copy number variations at individual exons, we compared the normalized coverage of each exon of testing patients with that of controls. Coverage data and bars/plots graph were produced using Prism 6. Exons with a normalized coverage ratio below/above 0.7/1.3 of the mean in controls were classified as heterozygously deleted/ duplicated and further analyzed by MLPA [8]. A 20-bp interval was used for precise breakpoint determination.

Sanger sequencing
Sanger sequencing was performed to confirm all detected variants. The primers Prism 3130 were designed using the online software Primer3. The amplification reaction mixture (50 μl) was subjected to denaturation at 95°C for 2 min followed by 30 cycles at 94°C for 1 min, annealing temperature 60°C for 1 min, 72°C for 1 min and by a final extension at 72°C for 15 min. Bi-directional sequencing was performed using BigDye Terminator v3.1

Results
A total of 40 samples were included in this study. Nineteen patients with 15 different types of FLCN pathogenic mutations previously identified by conventional Sanger sequencing and MLPA were chosen as positive controls to evaluate the performance of the target enrichment method. In addition, 21 blinded samples from genetically untested PSP patients were also examined by this method to identify the underlying mutation(s) responsible for PSP and to assess this method for its diagnostic potential. All the results were re-confirmed by conventional Sanger sequencing and MLPA. The strategy used in this study were detailed in Fig. 1c and Table 1.

Sequencing quality metrics
The obtained sequencing data was analyzed by FastQC software, which indicated good-quality sequencing metrics, with an average 80.53 % of the sequenced bases above Q30. On average, a mean coverage of 364× was obtained per sample (ranged 210 ×-699×) (Fig. 1a). Different average coverage were obtained for the seven analyzed genes, ranging from 243× (COL3A1) to 592× (SERPINA1) (Fig. 1b). All genes were covered at more than 30× for at least 96 % of the target regions, promising the CNV analysis. The percentage of sequencing on target was 72.35 %.

Identification of SNVs, indels and CNVs in test group: validation of the diagnostic strategy
All 15 different pathogenic sequence changes were correctly identified in the 19 positive control samples (Table 2; patients F01-F19), resulting in a diagnostic rate of 100 % for the detection of pathogenic mutations in our system. These encompassed a variety of mutations, including 2 SNVs (1 stop gain and 1 splice site mutation), 4 insertions/6 deletions ranging from 1 to 23 bases and 3 large intragenic deletions. The false-negative rate was 0 %. All samples were analyzed blind to their mutations.

Detection of exon deletions and breakpoint analysis
We observed that the coverage varied significantly between positions from the same sample, however, the samples captured with an identical protocol had very similar distribution of sequencing depth. In this context, we normalized coverage of each exon for on-target mapped bases for that sample and performed the CNV analysis using a depthbased method. This analysis allowed us to identify 4 patients with heterozygous large intragenic deletions of FLCN gene in our study cohort: 1 deletion in FLCN comprised exons 1-3 (average normalized depth ratio = 0.356). One deletion in FLCN comprised exon 14 (normalized depth  (Fig. 2a-d).
We subsequently managed to identify exact breakpoints of the deletions by calculating the normalized depth of a set of 20 bp-interval contigs covered the entire FLCN gene (Fig. 2e-h, Table 3). For patient F16, by using MLPA, of which probes are designed in the exons of FLCN gene, we detected a deletion of exons 1-3 [8]. The estimated deletion size is between 5346-18666 bp in length, while with the NGS strategy, we were able to restrict it to about 7650 bp, closed to its actual deletion size (Fig. 2i). Likewise, in patient F17 and F18/F19, we reduced the estimated size of deletion from 1645-8671 bp and 5555-13782 bp down to about 1740 bp and 6117-12017 bp, respectively (Fig. 2i).
The breakpoints for four large intragenic deletions in FLCN had been identified by genomic amplification of the DNA regions across the deletion junctions, followed by Sanger sequencing. Each large deletion is flanked by Alu repeat sequences that mediate the mutation. Our NGS system accurately identified exon deletions and determined breakpoints on the targeted sequence within about 100 bp, largely narrowed the boundaries of the deletions, which made it much easier to design primers for genomic amplification of the deletion junction.

Mutation analysis in blinded sample group
Our study shown there were an average of 194 variants per sample within the 214 kb targeted region in our blinded PSP cohort of PSP. After removing all SNVs/ indels occurring within non-coding regions, the variations were filtered against dbSNP137 NonFlagged, 1000 genomes and ESP 6500 databases, and then submitted to PolyPhen and SIFT to evaluate possible impact of the variations on protein structure and function. Using this filtering criteria, there was 0-1 novel, predicted harmful variant remained for each sample. Of the 21 patients, we detected clearly pathogenic mutations in 3 cases (2 in FLCN, 1 in FBN1) and likely pathogenic mutations in     (Table 4). All of the variants were confirmed by Sanger sequencing. Two small frameshift deletions were identified in FLCN gene in patients F29 and F37, both resulting in a truncated protein of FLCN. And one of them, a c.1648_1658 deletion in exon 14 of the FLCN gene had been detected in patient F09. A nonsense mutation were detected in FBN1 gene in patient F32, it was a C to T transition (c.6169 C > T) in exon 51, predicting an early stop codon at position 2057 (p.Arg2057Ter) which causes a premature termination of the protein. This mutation has not been recorded in any database including FBN1 mutation database [41]. In this unscreened cohort of PSP patients, we also identified two missense mutations predicted to be damaging by SIFT and Polyphen2. One of them was a c.2269G > C transversion in exon 19 of the FBN1 gene, which converted the codon GAT for Aspartic at position 757 to CAT, a codon for Histidine. This variant is predicted by ScanProsite to situate in the EGF-like calcium-binding domain of FBN1 [42]. The other was a c.1631G > A transition in exon 15 of the TSC1 gene, which was predicted to substitute a polar uncharged glycine residue for a negatively charged glutamic residue (p.Gly544-Glu), presenting a distinct change in amino acid properties. We also detected a G to A point mutation in intron 10 (c.1177-21G > A) of FLCN gene (Fig. 3a). This variant located within the branch-site sequence, by using the ESEfinder, we obtained a matrix score for the c.1177-21G > A substitution (2.2919) of the branch-site sequence, relatively less than a score for the wild type (2.7018) (Fig. 3b), indicating that it may disrupt the branch-site in this gene. However, because of unavailability of the patient's tissue, we were unable to verify it at transcript level.
In general, NGS method not only successfully identified missense, nonsense, splice site, intronic branch site, small indel, and large deletion changes in positive and blinded samples, but also determined approximate deletion junction within about 100 bp deviation by using a depth-based method.

Discussion
In this study, to establish a genetic diagnostic method for PSP, we evaluated the performance of targeted enrichment next-generation sequencing for the molecular diagnosis of PSP patients. We enriched 7 genes which are known to cause PSP using our self-designed Haloplex panel (Table 1) and sequenced them on Miseq platform. In test group, our target-enrichment NGS method successfully identified all 19 pathogenic mutations including SNVs, Indels and large intragenic mutations previously detected by Sanger sequencing and MLPA. No false negative result in test group was obtained. Moreover, we also detected the breakpoints of the deletions within 100 bp using a depth-based bioinformatics method. In blinded sample group, we detected 3 definitely pathogenic mutations including a novel FBN1 mutation and 3 likely pathogenic mutations including an intronic branch-site mutation. Therefore, we conclude that our system has a more extensive performance than Sanger sequencing and MLPA.
As BHD patients often receive medical attention for their pneumothorax onset in the hospital, genetic screening in PSP patients will help early diagnose of BHD syndrome, thus lead to early recognition and treatment of renal cell carcinoma. However, for screening, Sanger sequencing and MLPA are relatively low throughput, inefficient and expensive. Our targeted next-generation sequencing method successfully and efficiently identifies full spectrum of FLCN gene mutations in a single experiment.
By using a depth-based bioinformatics method, we also narrowed the range of breakpoints to a great extent, but we did not identify the precise breakpoints of the deletions, as it is reported that all FLCN intragenic deletions are Alu repeats-mediated [8], yet the low sensitivity in detecting CNVs in repeat regions is not completely solved [43], and Sanger sequencing is still the gold standard for breakpoint determination. Our NGS approach thus obviates the need for separate testing for genomic deletions/duplications after Sanger sequencing, and can also largely minimize the boundaries of the deletions, which will make it much easier to design primers for long-range PCR amplification. Our limitation is that one of the breakpoint of exon 9-14 deletion was not recovered since our design did not include the 3′ flank of FLCN gene. To optimize our method, the 5′ flank and 3′ flank sequences of tested genes should be included.
In the blinded PSP cohort, two out of twenty-one patients (9.5 %) were identified to carry pathogenic mutations in FLCN gene. This is in concordance with the previous report that up to 10 % of PSP patients are carrying FLCN mutations [7,8]. A novel nonsense mutation in FBN1 (c.6169C > T; p.Arg2057Ter) was found in a male PSP patient. It was classified as definitely pathogenic mutation and was not recorded in any database. Our finding expand the spectrum of mutations in Marfan syndrome. Interestingly, a novel intronic variant in FLCN (c.1177-21G > A) was also detected by this method, bringing forward the need of raising awareness of intronic variant in pathogenesis, which is often avoided by traditional Sanger sequence.
Among the targeted genes, we also identified three mutations in FBN1 and TSC1 gene in PSP patients in addition to FLCN gene, but no predicted harmful variant was detected in TSC2, CBS, COL3A1 and SERPINA1 gene. Patients with mutations in these genes often express as syndromes and can be easily diagnosed. Isolated spontaneous pneumothorax patients with mutations in these genes are rare [44]. Nevertheless, pulmonologists should consider these syndromes and sequence these genes when treating patients for spontaneous pneumothorax, since early detection improves the prognosis significantly.
Moreover, a cost analysis of NGS approach showed significant savings. The cost with conventional Sanger sequencing is about $1760 per sample for 220 PCR and sequencing reactions, and the cost with 7 MLPA reactions is about $110 per sample. While a total cost of Haloplex target enrichment sequencing is about $400. With our NGS-based assay, a 78 % of cost savings per sample could be achieved in PSP diagnostic procedure.

Conclusions
This is the first study to evaluate the performance of NGS in the detection of exonic and intronic SNVs, Indels and large intragenic deletions, and to develop a cost-and time-effective screening system for the molecular diagnosis of BHDS in PSP cohort. In conclusion, NGS based Haloplex target enrichment technology is a rapid, high-throughput and cost-effective screening strategy for the molecular diagnosis of BHD in PSP patients, as it can replace Sanger sequencing and MLPA by simultaneously detecting exonic and intronic SNVs, small indels, large intragenic deletions and determining deletion junctions in PSP-related genes.