The clinical importance of mid-sized deletions is becoming increasingly apparent with a growing number of variants of this type being identified as pathogenic in the literature. These deletion variants are challenging to identify using traditional pipelines which struggle to align them. This document describes a new approach developed by SeqOne that facilitates the identification of such variants in clinical routine.
A growing variety of variants need to be evaluated
A key challenge in personalized medicine is the identification of a growing variety of genomic events, as researchers discover new relationships linking these events to diseases. Many of these genomic events are not easy to detect using traditional bioinformatic pipelines. As a result, clinical geneticists must continually evolve their processes to detect these new genomic variants and provide insights into their impact on patients.
Mid-sized deletions are challenging to detect
Mid-sized deletion variants are a typical example of these difficult to detect variants. We define mid-sized deletions as deletions whose length is of the same order of magnitude as a read (between 50 and 150 base-pairs). These types of deletions constitute a particular bioinformatic challenge because traditional pipelines have difficulty detecting deletions of between the 20-30 bases typical of indels and the larger CNV’s of several hundred bases.
The reason that mid-sized deletions are difficult to identify is that they result in reads that can’t be mapped using traditional aligners such as BWA-MEM . Such traditional mappers rely on having the majority of bases of the read present in the correct order and location to successfully map a read. Since mid-sized deletions can involve the loss of more than 70% of the reads’ bases, most of the impacted reads end up not being mapped correctly and are treated as “soft-clipped” or are not aligned at all. These soft-clipped reads pose difficulties in variant calling and annotation. Preliminary studies by SeqOne indicate that the proportion of reads not aligned and soft clipped can range from 3% of the reads in panel data to more than 20% of the reads in the case of whole-exome data.
Figure 1: Mid-sized deletion mappability problem
The clinical significance of mid-sized deletions: associated pathogenic annotations
The difficulty in detecting mid-sized deletions using traditional methods invites us to examine the relevance of such deletions to the patient. To answer this question, we undertook an analysis of data in the Clinvar database as well as reviewing the relevant academic literature.
We defined a mid-sized deletion as one that impacts between 50 and 150 base pairs. A preliminary analysis of ClinVar data revealed 70 entries related to variants that fit this criterion as opposed to almost 20K variants categorized as frameshift deletions which implies that mid-sized deletions correspond to approximately 0,3% of all frameshift deletions. Further analysis revealed that 55 of the 70 entries were listed as pathogenic. Among them, 9 were validated as such by recognized experts in the community (see table 1). It should be noted that this value probably underestimates the true number of pathogenic variants of this type, as the difficulty in the detection of mid-sized deletions leads to their being under-represented in the literature.
Table 1: Pathogenic mid-sized deletions with frameshift consequence reviewed at least by multiple submitters and panel experts 
The nine validated mid-sized deletions are associated with hereditary disease. Among them, several are associated with the BRCA1 gene which is known as an important driver in breast cancer, the most frequent cause of cancer mortality in women [3,4]. One mid-sized deletion is identified in FOXG1, known to be associated with Rett syndrome (1/30 000 in the general population) that can cause severe mental disorders and microcephaly . Others affect the GBA gene, known to be associated with Gaucher disease , an inherited syndrome (1/50 000 – 1/100 000 in the general population) causing spleen and liver enlargement, skeletal abnormalities, and blood disorders. It is likely that pathogenic effects of mid-sized deletions are under-reported in the literature because of the difficulties in detecting them using traditional bioinformatic approaches. As such, it appears that a robust mid-sized deletion detection capability is an important aspect of hereditary disease genomics and is critical for a precise and complete diagnosis.
A new approach to detect mid-sized deletions
Faced with this requirement, SeqOne developed a new approach to the detection of mid-sized deletions. The main challenge was to find a way to correctly align the soft-clipped reads. To achieve this, we developed a new pipeline including an aligner optimized for mRNA read alignment. These types of aligners can handle splicing which requires the aligning of several non-contiguous exons involved in transcription while omitting the introns. As this type of alignment is similar to the challenge involved in aligning reads with mid-sized deletions, we suspected that the mRNA-capable aligner would better detect mid-sized deletions. We selected minimap2 as an additional aligner and used it to realign the soft-clipped and non-aligned reads. This aligner was developed to align both DNA and mRNA long reads alignment . Minimap2 is described as accurate and efficient, often outperforming other domain-specific alignment tools in terms of both speed and accuracy .
To improve mid-sized deletion detection, our procedure includes 6 main steps :
- Alignment of fastq files with bwa against the reference genome;
- Retrieving reads with soft clipped parts that are > 20bp and non-mapped reads from bwa’s output bam. Twenty bases cut-off was chosen to be the most appropriate to select soft-clipped reads that are actually due to the mid-sized deletion. We consider that shorter the soft-clipped part is, less probable it is that it’s due to a deletion;
- Convert the retrieved reads to fastq format;
- Realign those reads with minimap2;
- Merge minimap2 and bwa obtained bam files. At this step, the mapping quality score is controlled and needs to be > 10. Aligned reads from bwa and minimap2 are compared reads per reads: reads that are less soft-clipped are selected and kept in the final BAM;
- Run variant calling with the final bam file.
The following diagram depicts the workflow:
Figure 2: SeqOne workflow for better detection of mid-sized deletions
To evaluate the ability of the additional aligner to recover soft-clipped reads we ran it on four data files: two panels of approximately three million reads as well as two exomes of 60 million reads were processed (see table 2 below).
Table 2: Reads recovered using mid-sized deletion correction in the pipeline
We observed that in the case of the two test panels, between 3% and 10% of the reads were soft clipped or non-aligned (3.23% in one panel – 10.38% in the other). In the case of the test exomes, between 24% and 30% of reads were soft clipped or non aligned (24.31% in one exome – 27.42% in the other). Preliminary estimations indicate that the additional processing to align the soft-clipped reads usually added less than 20% of the global processing time. With this new methodology, we were able to recover around 60% of soft clipped and non-mapped reads for panels and between 10% and 15% of them for whole-exome. This meant that on average, the methodology makes it possible to align almost 4% of reads that would otherwise be lost to the biologist. An example of the relevance of this approach is that a pathogenic mid-sized deletion of BRCA1 that was not detected using conventional pipelines was easily identified using this solution.
This document outlines the importance of mid-sized deletions, deletions of between 50bp and 150bp, and details a new approach that has been implemented in the SeqOne platform to identify them. This new approach seeks to overcome limitations in traditional alignment tools such as BWA-MEM that do not effectively detect mid-sized deletions. Our new methodology combines two aligners, BWA-MEM and minimap2, to enable the mapping of a significant number of reads that have mid-sized deletions. Mid-sized deletions have been identified as having pathogenic links to a number of diseases including breast cancer, Rett syndrome and Gaucher disease. It is likely that pathogenic effects of mid-sized deletions are under-reported in the literature because of the difficulties in detecting them using traditional bioinformatic approaches. The new approach to identifying mid-sized deletions bridges the gap between the detection of small deletions and CNV’s, thus providing almost complete coverage of all deletion events on the SeqOne platform. By incorporating bioinformatic tools that reveal more variations to help biologists obtain more information, SeqOne helps clinical geneticists improve their diagnostic performance.
1. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio [Internet]. 2013 [cited 2020 Mar 17]; Available from: http://arxiv.org/abs/1303.3997
2. Pérez-Palma E, Gramm M, Nürnberg P, May P, Lal D. Simple ClinVar: an interactive web server to explore and retrieve gene and disease variants aggregated in ClinVar database. Nucleic Acids Res. 2019;47:W99–105.
3. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68:394–424.
4. Semmler L, Reiter-Brennan C, Klein A. BRCA1 and Breast Cancer: a Review of the Underlying Mechanisms Resulting in the Tissue-Specific Tumorigenesis in Mutation Carriers. J Breast Cancer. 2019;22:1–14.
5. Allou L, Lambert L, Amsallem D, Bieth E, Edery P, Destrée A, et al. 14q12 and severe Rett-like phenotypes: new clinical insights and physical mapping of FOXG1-regulatory elements. Eur J Hum Genet EJHG. 2012;20:1216–23.
6. Riboldi GM, Di Fonzo AB. GBA, Gaucher Disease, and Parkinson’s Disease: From Genetic to Clinic to New Therapeutic Approaches. Cells. 2019;8.
7. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinforma Oxf Engl. 2018;34:3094–100.