Mobile element insertion (MEI) detection for NGS based clinical diagnostics

A growing number of scientific articles describe the pathogenic role of MEI’s, bringing a renewed focus on their importance in clinical diagnosis. Although NGS makes it possible to capture these types of variants, identifying them remains a challenge requiring complex bioinformatic pipelines. This document describes the characteristics of MEIs’ and challenges to be addressed in their identification. It then outlines a new approach that has been developed by SeqOne, to identify them in clinical routine environments.

MEIs detection can significantly improve clinical diagnostic 

Mobile element insertions are genomic variations that can exert significant influence on the genome and its biological function. They consist of endogenous DNA sequences that can copy and paste themselves in various genomic locations. In doing so they can disrupt important biological mechanisms leading to disease. As more links between MEI’s and pathologies are discovered, they are the subject of an increasing amount of studies. However, the difficulty in detecting them using existing bioinformatic solutions has limited their deployment in clinical routine environments. In consequence, it is likely that their influence and pathogenic associations are underestimated. SeqOne has developed a pipeline designed to detect MEIs and provide usable feedback on the impact of this type of genomic variant on the diagnosis.

MEIs mechanism and detection challenges specificity

Mobile element insertions are genomic structural variations produced through retrotransposition. They are defined as genetic elements that can move using a genetic “copy – paste” mechanism to different genomic locations disrupting genetic function as they do so. This process is controlled by a reverse transcription mechanism involving RNA intermediates (Figure 1). Several types of MEIs exist, including LINE-1 (or L1), SVA, and Alu. Approximately 500,000 Long INterspersed Element-1 (LINE-1 or L1) variants and 1.1 million Alu elements, comprising respectively 17% and 11% of a human genomic sequence [1] have been identified. SINE-VNTR- ALU (SVA) elements are rarer and constitute approximately 0.2% of the human genomic sequences [1].

Initially, MEIs were detected using CGH array, Southern blot, Sanger sequencing, or qPCR. These techniques all have limitations in detecting these types of structural variations [1]. For instance, Sanger sequencing is limited in its ability to detect larger insertions (L1 elements) [1]. Next-Generation Sequencing (NGS) opens new perspectives in detecting this type of variant. However, MEIs detection requires specific bioinformatic pipeline developments. Indeed, as structural variants, they are responsible for larger genomic rearrangements which cause read soft-clipping during the mapping. The other difficulty in identifying MEIs is that they involve the same genomic sequences inserted in different locations on the genome which lead to the mapping of reads in different locations and result in discordant read mapping across the genome [2]. Moreover, the presence of numerous copies in the genome can introduce mapping artifacts and lead to false-positives making it important to implement numerous filtering steps [2] (Figure 1).

Figure 1: Retrotransposition mechanism and NGS detection specificity

MEI impact on patient’s health 

By their ability to be actively copied and pasted in different genomic positions, mobile elements can be inserted into the genome, creating dysregulations that lead to genetic disorders. Up to now, more than 120 pathogenic variants caused by retrotransposon activity have been documented. Among them 76 were caused by Alu, 30 were caused by L1 and 13 by SVA [3]. They were involved in numerous diseases including hemophilia (A & B), breast cancer, cystic fibrosis, and Apert syndrome (Table 1). Hemophilia A (1/5000 male birth) and B (1/30000 birth) are rare X-linked disorders caused by mutations in FVIII and FIX genes [4]. In severe forms, internal deep bleeding can lead to long-term disability, especially on joints, including muscle atrophy, pseudo-tumors, impaired mobility, and chronic pain [4]. Cystic fibrosis is the most common genetic disorder among Caucasian children (prevalence of between 1/8000 and 1/10000 in Europe). It is characterized by the production of thick mucus that causes severe damages in the lung and digestive system that can have fatal issues. It has been found that impairments in CFTR genes are associated with this disease [5]. Apert syndrome is a rare genetic disease characterized by skeletal abnormalities and associated with impairment of the FGFR2 gene [6]. Recently, 37 unique, pathogenic RE insertions were identified in 10 cancer risk genes [1]. Moreover, in a recent study, Rebecca I. et al have analyzed 89 874 clinical exomes and have reported 14 MEIs classified as pathogenic or likely pathogenic according to ACMG [7]. In the same study, it is estimated that MEIs assessment and finding could increase diagnostic yield by 0.15% [7]. Overall it is estimated that MEIs are responsible for disease in 0.04% to 0.1% of individuals with suspicion of genetic disease [7]. All of these studies show that MEIs are involved in numerous heritable pathologies. The following table recapitulates some of them found in the literature (Table 1):

Table 1: Examples of genes in which can be found MEIs  

The SeqOne approach for detecting MEIs

SeqOne developed a new methodology for the detection of MEI’s, that is currently available in our germline pipeline. This pipeline is composed of three main steps containing several filtering and controlling sub-steps.  

  • STEP-I: MEI detection

The aim of this step is to detect all candidate breakpoints of possible MEI and the related sequence consensus. This step includes three substeps : 

  1. Retrieving of soft-clipped reads. The soft clipped sequence needs to have a minimum length of 5pb, a cut off above which we consider them of interest for the further steps. 
  2. Clustering by genomic position. Only soft clipped reads of sufficient quality are taken into account for this step. The quality is calculated based on the quality of each base of the read and the read length. For a cluster to be selected, it needs to be composed of at least 10 good quality soft clipped reads (default value). This step also includes a filter on the maximal number of neighbors breakpoints for a given cluster. This filter is important as the more soft-clipped reads occur near a position, the more background noise can be observed, increasing the difficulty in analyzing the region.
  3. Retrieving of the consensus sequences. The consensus sequences are selected on their length, the number of mismatches, and the read mean quality. Selecting regions that correspond to our quality in this way limits false positives. Moreover, regions with a high number of mismatches are more likely to be false positives. The quality of PolyA tail present in MEIs is not taken into account at this step since it has inherently low-quality scores and can lead to false negatives. At this point, consensus sequences are identified with the following information: chromosome containing the breakpoint, position of the breakpoint, side of the soft-clipped sequence, the allele of reference, coverage at the breakpoint, consensus sequence and quality score. 
  • STEP-II: MEI identification

The aim of this step is to align the retrieved consensus sequences (cs) to a database of transposable elements (Dfam) and return the breakpoints that have the best alignment so that cs are aligned with nhmmer. To select the best alignments the following filters were applied: evalue < 0.01 and alignment score > 30. 

  • STEP-III: MEI annotation 

In this step, several files are taken in order to do the MEI annotation: Dfam database file (.hmm), aligned cs file (.txt), refGene (.bed), refSeq canonical transcript (.tsv) and reference genome file (.fa). It returns a VCF file containing selected and annotated MEIs inside coding regions. It is finally merged with the VCF file containing other types of variants. 

The pipeline detects all previously described MEIs (L1, SVA, and Alu). 

The following diagram depicts the workflow developed by SeqOne: 

Figure 2: SeqOne workflow for the detection of MEI

Our workflow detected four Alu validated controls in gene panels validation data, presented in the following table :

Table 2: Alu validated controls detected with SeqOne pipeline

Conclusion

This document outlines the importance of detecting mobile element insertions (MEIs) and describes a new SeqOne functionality to identify them. This new approach accurately calls several types of MEIs events, LINE-1 (or L1), SVA, and Alu, and preliminary results demonstrate the accuracy assessment of four validated MEIs. A growing number of scientific studies show that MEIs are involved in diseases including hemophilia, breast cancer, and cystic fibrosis. However, due to the technical limitations and necessity of specific bioinformatics pipelines, the involvement of MEIs in pathology is currently underestimated. This new approach, included in our pipelines, enriches our existing detection capabilities to provide a more accurate view of pathogenic variants and improve clinicians’ diagnosis.

References and Credits

We thank the French medical laboratory Cerba for providing some of the control samples mentioned in this article, and for their contribution in improving the performances of AluMEI in the early stages of its development.

1. Qian Y, Mancini-DiNardo D, Judkins T, Cox HC, Brown K, Elias M, et al. Identification of pathogenic retrotransposon insertions in cancer predisposition genes. Cancer Genet. 2017;216–217:159–69.

2. Ewing AD. Transposable element detection from whole genome sequence data. Mob DNA. 2015;6:24.

3. Hancks DC, Kazazian HH. Roles for retrotransposon insertions in human disease. Mob DNA. 2016;7:9.

4. Castaman G, Matino D. Hemophilia A and B: molecular and clinical similarities and differences. Haematologica. 2019;104:1702–9.

5. Mall MA, Hartl D. CFTR: cystic fibrosis and beyond. Eur Respir J. 2014;44:1042–54.

6. Azoury SC, Reddy S, Shukla V, Deng C-X. Fibroblast Growth Factor Receptor 2 (FGFR2) Mutation Related Syndromic Craniosynostosis. Int J Biol Sci. 2017;13:1479–88.

7. Torene RI, Galens K, Liu S, Arvai K, Borroto C, Scuffins J, et al. Mobile element insertion detection in 89,874 clinical exomes. Genet Med Off J Am Coll Med Genet. 2020.

Challenges and importance of mid-sized deletion identification for genomic medicine

The clinical importance of mid-sized deletions is becoming increasingly apparent with a growing number of variants of this type being identified as pathogenic in the literature. These deletion variants are challenging to identify using traditional pipelines which struggle to align them. This document describes a new approach developed by SeqOne that facilitates the identification of such variants in clinical routine. 

A growing variety of variants need to be evaluated

A key challenge in personalized medicine is the identification of a growing variety of genomic events, as researchers discover new relationships linking these events to diseases. Many of these genomic events are not easy to detect using traditional bioinformatic pipelines. As a result, clinical geneticists must continually evolve their processes to detect these new genomic variants and provide insights into their impact on patients. 

Mid-sized deletions are challenging to detect 

Mid-sized deletion variants are a typical example of these difficult to detect variants. We define mid-sized deletions as deletions whose length is of the same order of magnitude as a read (between 50 and 150 base-pairs). These types of deletions constitute a particular bioinformatic challenge because traditional pipelines have difficulty detecting deletions of between the 20-30 bases typical of indels and the larger CNV’s of several hundred bases.  

The reason that mid-sized deletions are difficult to identify is that they result in reads that can’t be mapped using traditional aligners such as BWA-MEM [1]. Such traditional mappers rely on having the majority of bases of the read present in the correct order and location to successfully map a read. Since mid-sized deletions can involve the loss of more than 70% of the reads’ bases, most of the impacted reads end up not being mapped correctly and are treated as “soft-clipped” or are not aligned at all. These soft-clipped reads pose difficulties in variant calling and annotation. Preliminary studies by SeqOne indicate that the proportion of reads not aligned and soft clipped can range from 3% of the reads in panel data to more than 20% of the reads in the case of whole-exome data.

Figure 1: Mid-sized deletion mappability problem

The clinical significance of mid-sized deletions: associated pathogenic annotations

The difficulty in detecting mid-sized deletions using traditional methods invites us to examine the relevance of such deletions to the patient. To answer this question, we undertook an analysis of data in the Clinvar database as well as reviewing the relevant academic literature. 

We defined a mid-sized deletion as one that impacts between 50 and 150 base pairs. A preliminary analysis of ClinVar data revealed 70 entries related to variants that fit this criterion as opposed to almost 20K variants categorized as frameshift deletions which implies that mid-sized deletions correspond to approximately 0,3% of all frameshift deletions. Further analysis revealed that 55 of the 70 entries were listed as pathogenic. Among them, 9 were validated as such by recognized experts in the community (see table 1). It should be noted that this value probably underestimates the true number of pathogenic variants of this type, as the difficulty in the detection of mid-sized deletions leads to their being under-represented in the literature.

Table 1: Pathogenic mid-sized deletions with frameshift consequence reviewed at least by multiple submitters and panel experts [2]

The nine validated mid-sized deletions are associated with hereditary disease. Among them, several are associated with the BRCA1 gene which is known as an important driver in breast cancer, the most frequent cause of cancer mortality in women [3,4]. One mid-sized deletion is identified in FOXG1, known to be associated with Rett syndrome (1/30 000 in the general population) that can cause severe mental disorders and microcephaly [5]. Others affect the GBA gene, known to be associated with Gaucher disease [6], an inherited syndrome (1/50 000 – 1/100 000 in the general population) causing spleen and liver enlargement, skeletal abnormalities, and blood disorders.  It is likely that pathogenic effects of mid-sized deletions are under-reported in the literature because of the difficulties in detecting them using traditional bioinformatic approaches. As such, it appears that a robust mid-sized deletion detection capability is an important aspect of hereditary disease genomics and is critical for a precise and complete diagnosis.

A new approach to detect mid-sized deletions

Faced with this requirement, SeqOne developed a new approach to the detection of mid-sized deletions. The main challenge was to find a way to correctly align the soft-clipped reads. To achieve this, we developed a new pipeline including an aligner optimized for mRNA read alignment. These types of aligners can handle splicing which requires the aligning of several non-contiguous exons involved in transcription while omitting the introns. As this type of alignment is similar to the challenge involved in aligning reads with mid-sized deletions, we suspected that the mRNA-capable aligner would better detect mid-sized deletions. We selected minimap2 as an additional aligner and used it to realign the soft-clipped and non-aligned reads. This aligner was developed to align both DNA and mRNA long reads alignment [7]. Minimap2 is described as accurate and efficient, often outperforming other domain-specific alignment tools in terms of both speed and accuracy [7].

To improve mid-sized deletion detection, our procedure includes 6 main steps : 

  1. Alignment of fastq files with bwa against the reference genome;
  2. Retrieving reads with soft clipped parts that are > 20bp and non-mapped reads from bwa’s output bam. Twenty bases cut-off was chosen to be the most appropriate to select soft-clipped reads that are actually due to the mid-sized deletion. We consider that shorter the soft-clipped part is, less probable it is that it’s due to a deletion;
  3. Convert the retrieved reads to fastq format;
  4. Realign those reads with minimap2;
  5. Merge minimap2 and bwa obtained bam files. At this step, the mapping quality score is controlled and needs to be > 10. Aligned reads from bwa and minimap2 are compared reads per reads: reads that are less soft-clipped are selected and kept in the final BAM;
  6. Run variant calling with the final bam file. 

The following diagram depicts the workflow:

Figure 2: SeqOne workflow for better detection of mid-sized deletions

To evaluate the ability of the additional aligner to recover soft-clipped reads we ran it on four data files: two panels of approximately three million reads as well as two exomes of 60 million reads were processed (see table 2 below).

Table 2: Reads recovered using mid-sized deletion correction in the pipeline

We observed that in the case of the two test panels, between 3% and 10% of the reads were soft clipped or non-aligned (3.23% in one panel – 10.38% in the other). In the case of the test exomes, between 24% and 30% of reads were soft clipped or non aligned (24.31% in one exome – 27.42% in the other). Preliminary estimations indicate that the additional processing to align the soft-clipped reads usually added less than 20% of the global processing time. With this new methodology, we were able to recover around 60% of soft clipped and non-mapped reads for panels and between 10% and 15% of them for whole-exome. This meant that on average, the methodology makes it possible to align almost 4% of reads that would otherwise be lost to the biologist. An example of the relevance of this approach is that a pathogenic mid-sized deletion of BRCA1 that was not detected using conventional pipelines was easily identified using this solution.  

Conclusion

This document outlines the importance of mid-sized deletions, deletions of between 50bp and 150bp, and details a new approach that has been implemented in the SeqOne platform to identify them. This new approach seeks to overcome limitations in traditional alignment tools such as BWA-MEM that do not effectively detect mid-sized deletions. Our new methodology combines two aligners, BWA-MEM and minimap2, to enable the mapping of a significant number of reads that have mid-sized deletions. Mid-sized deletions have been identified as having pathogenic links to a number of diseases including breast cancer, Rett syndrome and Gaucher disease. It is likely that pathogenic effects of mid-sized deletions are under-reported in the literature because of the difficulties in detecting them using traditional bioinformatic approaches. The new approach to identifying mid-sized deletions bridges the gap between the detection of small deletions and CNV’s, thus providing almost complete coverage of all deletion events on the SeqOne platform. By incorporating bioinformatic tools that reveal more variations to help biologists obtain more information, SeqOne helps clinical geneticists improve their diagnostic performance.

References

1. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio [Internet]. 2013 [cited 2020 Mar 17]; Available from: http://arxiv.org/abs/1303.3997

2. Pérez-Palma E, Gramm M, Nürnberg P, May P, Lal D. Simple ClinVar: an interactive web server to explore and retrieve gene and disease variants aggregated in ClinVar database. Nucleic Acids Res. 2019;47:W99–105.

3. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394–424.

4. Semmler L, Reiter-Brennan C, Klein A. BRCA1 and Breast Cancer: a Review of the Underlying Mechanisms Resulting in the Tissue-Specific Tumorigenesis in Mutation Carriers. J Breast Cancer. 2019;22:1–14.

5. Allou L, Lambert L, Amsallem D, Bieth E, Edery P, Destrée A, et al. 14q12 and severe Rett-like phenotypes: new clinical insights and physical mapping of FOXG1-regulatory elements. Eur J Hum Genet EJHG. 2012;20:1216–23.

6. Riboldi GM, Di Fonzo AB. GBA, Gaucher Disease, and Parkinson’s Disease: From Genetic to Clinic to New Therapeutic Approaches. Cells. 2019;8.

7. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinforma Oxf Engl. 2018;34:3094–100.