Mobile element insertion (MEI) detection for NGS based clinical diagnostics

A growing number of scientific articles describe the pathogenic role of MEI’s, bringing a renewed focus on their importance in clinical diagnosis. Although NGS makes it possible to capture these types of variants, identifying them remains a challenge requiring complex bioinformatic pipelines. This document describes the characteristics of MEIs’ and challenges to be addressed in their identification. It then outlines a new approach that has been developed by SeqOne, to identify them in clinical routine environments.

MEIs detection can significantly improve clinical diagnostic 

Mobile element insertions are genomic variations that can exert significant influence on the genome and its biological function. They consist of endogenous DNA sequences that can copy and paste themselves in various genomic locations. In doing so they can disrupt important biological mechanisms leading to disease. As more links between MEI’s and pathologies are discovered, they are the subject of an increasing amount of studies. However, the difficulty in detecting them using existing bioinformatic solutions has limited their deployment in clinical routine environments. In consequence, it is likely that their influence and pathogenic associations are underestimated. SeqOne has developed a pipeline designed to detect MEIs and provide usable feedback on the impact of this type of genomic variant on the diagnosis.

MEIs mechanism and detection challenges specificity

Mobile element insertions are genomic structural variations produced through retrotransposition. They are defined as genetic elements that can move using a genetic “copy – paste” mechanism to different genomic locations disrupting genetic function as they do so. This process is controlled by a reverse transcription mechanism involving RNA intermediates (Figure 1). Several types of MEIs exist, including LINE-1 (or L1), SVA, and Alu. Approximately 500,000 Long INterspersed Element-1 (LINE-1 or L1) variants and 1.1 million Alu elements, comprising respectively 17% and 11% of a human genomic sequence [1] have been identified. SINE-VNTR- ALU (SVA) elements are rarer and constitute approximately 0.2% of the human genomic sequences [1].

Initially, MEIs were detected using CGH array, Southern blot, Sanger sequencing, or qPCR. These techniques all have limitations in detecting these types of structural variations [1]. For instance, Sanger sequencing is limited in its ability to detect larger insertions (L1 elements) [1]. Next-Generation Sequencing (NGS) opens new perspectives in detecting this type of variant. However, MEIs detection requires specific bioinformatic pipeline developments. Indeed, as structural variants, they are responsible for larger genomic rearrangements which cause read soft-clipping during the mapping. The other difficulty in identifying MEIs is that they involve the same genomic sequences inserted in different locations on the genome which lead to the mapping of reads in different locations and result in discordant read mapping across the genome [2]. Moreover, the presence of numerous copies in the genome can introduce mapping artifacts and lead to false-positives making it important to implement numerous filtering steps [2] (Figure 1).

Figure 1: Retrotransposition mechanism and NGS detection specificity

MEI impact on patient’s health 

By their ability to be actively copied and pasted in different genomic positions, mobile elements can be inserted into the genome, creating dysregulations that lead to genetic disorders. Up to now, more than 120 pathogenic variants caused by retrotransposon activity have been documented. Among them 76 were caused by Alu, 30 were caused by L1 and 13 by SVA [3]. They were involved in numerous diseases including hemophilia (A & B), breast cancer, cystic fibrosis, and Apert syndrome (Table 1). Hemophilia A (1/5000 male birth) and B (1/30000 birth) are rare X-linked disorders caused by mutations in FVIII and FIX genes [4]. In severe forms, internal deep bleeding can lead to long-term disability, especially on joints, including muscle atrophy, pseudo-tumors, impaired mobility, and chronic pain [4]. Cystic fibrosis is the most common genetic disorder among Caucasian children (prevalence of between 1/8000 and 1/10000 in Europe). It is characterized by the production of thick mucus that causes severe damages in the lung and digestive system that can have fatal issues. It has been found that impairments in CFTR genes are associated with this disease [5]. Apert syndrome is a rare genetic disease characterized by skeletal abnormalities and associated with impairment of the FGFR2 gene [6]. Recently, 37 unique, pathogenic RE insertions were identified in 10 cancer risk genes [1]. Moreover, in a recent study, Rebecca I. et al have analyzed 89 874 clinical exomes and have reported 14 MEIs classified as pathogenic or likely pathogenic according to ACMG [7]. In the same study, it is estimated that MEIs assessment and finding could increase diagnostic yield by 0.15% [7]. Overall it is estimated that MEIs are responsible for disease in 0.04% to 0.1% of individuals with suspicion of genetic disease [7]. All of these studies show that MEIs are involved in numerous heritable pathologies. The following table recapitulates some of them found in the literature (Table 1):

Table 1: Examples of genes in which can be found MEIs  

The SeqOne approach for detecting MEIs

SeqOne developed a new methodology for the detection of MEI’s, that is currently available in our germline pipeline. This pipeline is composed of three main steps containing several filtering and controlling sub-steps.  

  • STEP-I: MEI detection

The aim of this step is to detect all candidate breakpoints of possible MEI and the related sequence consensus. This step includes three substeps : 

  1. Retrieving of soft-clipped reads. The soft clipped sequence needs to have a minimum length of 5pb, a cut off above which we consider them of interest for the further steps. 
  2. Clustering by genomic position. Only soft clipped reads of sufficient quality are taken into account for this step. The quality is calculated based on the quality of each base of the read and the read length. For a cluster to be selected, it needs to be composed of at least 10 good quality soft clipped reads (default value). This step also includes a filter on the maximal number of neighbors breakpoints for a given cluster. This filter is important as the more soft-clipped reads occur near a position, the more background noise can be observed, increasing the difficulty in analyzing the region.
  3. Retrieving of the consensus sequences. The consensus sequences are selected on their length, the number of mismatches, and the read mean quality. Selecting regions that correspond to our quality in this way limits false positives. Moreover, regions with a high number of mismatches are more likely to be false positives. The quality of PolyA tail present in MEIs is not taken into account at this step since it has inherently low-quality scores and can lead to false negatives. At this point, consensus sequences are identified with the following information: chromosome containing the breakpoint, position of the breakpoint, side of the soft-clipped sequence, the allele of reference, coverage at the breakpoint, consensus sequence and quality score. 
  • STEP-II: MEI identification

The aim of this step is to align the retrieved consensus sequences (cs) to a database of transposable elements (Dfam) and return the breakpoints that have the best alignment so that cs are aligned with nhmmer. To select the best alignments the following filters were applied: evalue < 0.01 and alignment score > 30. 

  • STEP-III: MEI annotation 

In this step, several files are taken in order to do the MEI annotation: Dfam database file (.hmm), aligned cs file (.txt), refGene (.bed), refSeq canonical transcript (.tsv) and reference genome file (.fa). It returns a VCF file containing selected and annotated MEIs inside coding regions. It is finally merged with the VCF file containing other types of variants. 

The pipeline detects all previously described MEIs (L1, SVA, and Alu). 

The following diagram depicts the workflow developed by SeqOne: 

Figure 2: SeqOne workflow for the detection of MEI

Our workflow detected four Alu validated controls in gene panels validation data, presented in the following table :

Table 2: Alu validated controls detected with SeqOne pipeline

Conclusion

This document outlines the importance of detecting mobile element insertions (MEIs) and describes a new SeqOne functionality to identify them. This new approach accurately calls several types of MEIs events, LINE-1 (or L1), SVA, and Alu, and preliminary results demonstrate the accuracy assessment of four validated MEIs. A growing number of scientific studies show that MEIs are involved in diseases including hemophilia, breast cancer, and cystic fibrosis. However, due to the technical limitations and necessity of specific bioinformatics pipelines, the involvement of MEIs in pathology is currently underestimated. This new approach, included in our pipelines, enriches our existing detection capabilities to provide a more accurate view of pathogenic variants and improve clinicians’ diagnosis.

References and Credits

We thank the French medical laboratory Cerba for providing some of the control samples mentioned in this article, and for their contribution in improving the performances of AluMEI in the early stages of its development.

1. Qian Y, Mancini-DiNardo D, Judkins T, Cox HC, Brown K, Elias M, et al. Identification of pathogenic retrotransposon insertions in cancer predisposition genes. Cancer Genet. 2017;216–217:159–69.

2. Ewing AD. Transposable element detection from whole genome sequence data. Mob DNA. 2015;6:24.

3. Hancks DC, Kazazian HH. Roles for retrotransposon insertions in human disease. Mob DNA. 2016;7:9.

4. Castaman G, Matino D. Hemophilia A and B: molecular and clinical similarities and differences. Haematologica. 2019;104:1702–9.

5. Mall MA, Hartl D. CFTR: cystic fibrosis and beyond. Eur Respir J. 2014;44:1042–54.

6. Azoury SC, Reddy S, Shukla V, Deng C-X. Fibroblast Growth Factor Receptor 2 (FGFR2) Mutation Related Syndromic Craniosynostosis. Int J Biol Sci. 2017;13:1479–88.

7. Torene RI, Galens K, Liu S, Arvai K, Borroto C, Scuffins J, et al. Mobile element insertion detection in 89,874 clinical exomes. Genet Med Off J Am Coll Med Genet. 2020.