+ Most Popular
Cunninghamia lanceolata plantations in China
Mammalian lairs in paleo ecological studies and palynology
Studies on technological possibilities in utilization of anhydrous milk fat for production of recombined butter-like products
Should right-sided fibroelastomas be operated upon?
Large esophageal lipoma
Apoptosis in the mammalian thymus during normal histogenesis and under various in vitro and in vivo experimental conditions
Poissons characoides nouveaux ou non signales de l'Ilha do Bananal, Bresil
Desensitizing efficacy of Colgate Sensitive Maximum Strength and Fresh Mint Sensodyne dentifrices
Administration of fluid by subcutaneous infusion: revival of a forgotten method
Tundra mosquito control - an impossible dream?
Schizophrenia for primary care providers: how to contribute to the care of a vulnerable patient population
Geochemical pattern analysis; method of describing the Southeastern limestone regional aquifer system
Incidence of low birth weights in a hospital of Mexico City
Graded management intensity of grassland systems for enhancing floristic diversity
Microbiology and biochemistry of cheese and fermented milk
The ember tetra: a new pygmy characid tetra from the Rio das Mortes, Brazil, Hyphessobrycon amandae sp. n. (Pisces, Characoidei)
Risk factors of contrast-induced nephropathy in patients after coronary artery intervention
Renovation of onsite domestic wastewater in a poorly drained soil
Observations of the propagation velocity and formation mechanism of burst fractures caused by gunshot
Systolic blood pressure in a population of infants in the first year of life: the Brompton study
Haematological studies in rats fed with metanil yellow
Studies on pasteurellosis. I. A new species of Pasteurella encountered in chronic fowl cholera
Dormancy breaking and germination of Acacia salicina Lindl. seeds
therapy of lupus nephritis. a two-year prospective study

ADS-HCSpark: a scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark

ADS-HCSpark: a scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark

Bmc Bioinformatics 20(1): 76

ISSN/ISBN: 1471-2105

PMID: 30764760

DOI: 10.1186/s12859-019-2665-0

The advance of next generation sequencing enables higher throughput with lower price, and as the basic of high-throughput sequencing data analysis, variant calling is widely used in disease research, clinical treatment and medicine research. However, current mainstream variant caller tools have a serious problem of computation bottlenecks, resulting in some long tail tasks when performing on large datasets. This prevents high scalability on clusters of multi-node and multi-core, and leads to long runtime and inefficient usage of computing resources. Thus, a high scalable tool which could run in distributed environment will be highly useful to accelerate variant calling on large scale genome data. In this paper, we present ADS-HCSpark, a scalable tool for variant calling based on Apache Spark framework. ADS-HCSpark accelerates the process of variant calling by implementing the parallelization of mainstream GATK HaplotypeCaller algorithm on multi-core and multi-node. Aiming at solving the problem of computation skew in HaplotypeCaller, a parallel strategy of adaptive data segmentation is proposed and a variant calling algorithm based on adaptive data segmentation is implemented, which achieves good scalability on both single-node and multi-node. For the requirement that adjacent data blocks should have overlapped boundaries, Hadoop-BAM library is customized to implement partitioning BAM file into overlapped blocks, further improving the accuracy of variant calling. ADS-HCSpark is a scalable tool to achieve variant calling based on Apache Spark framework, implementing the parallelization of GATK HaplotypeCaller algorithm. ADS-HCSpark is evaluated on our cluster and in the case of best performance that could be achieved in this experimental platform, ADS-HCSpark is 74% faster than GATK3.8 HaplotypeCaller on single-node experiments, 57% faster than GATK4.0 HaplotypeCallerSpark and 27% faster than SparkGA on multi-node experiments, with better scalability and the accuracy of over 99%. The source code of ADS-HCSpark is publicly available at .

Please choose payment method:

(PDF emailed within 0-6 h: $19.90)

Accession: 066496063

Download citation: RISBibTeXText

Related references

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark. Bmc Bioinformatics 20(1): 493, 2019

A simple data-adaptive probabilistic variant calling model. Algorithms for Molecular Biology: Amb 10: 10, 2015

Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies. Bmc Bioinformatics 16: 304, 2015

Leveraging Data can Help Accelerate Change. Provider 42(7): 37-39, 2016

Variant of the region-scalable fitting energy for image segmentation. Journal of the Optical Society of America. a Optics Image Science and Vision 32(3): 463-470, 2015

Leveraging Spatial Variation in Tumor Purity for Improved Somatic Variant Calling of Archival Tumor only Samples. Frontiers in Oncology 9: 119, 2019

BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data. Bmc Bioinformatics 19(1): 472, 2018

CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics 32(1): 133-135, 2016

VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering. Bioinformatics 31(1): 94-101, 2015

A scalable approach for tree segmentation within small-footprint airborne LiDAR data. Computers and Geosciences 102: 139-147, 2017

Variant Calling from Next Generation Sequence Data. Methods in Molecular Biology 1418: 209-224, 2016

FACADE: a fast and sensitive algorithm for the segmentation and calling of high resolution array CGH data. Nucleic Acids Research 38(15): E157, 2010

A class-adaptive spatially variant mixture model for image segmentation. IEEE Transactions on Image Processing: a Publication of the IEEE Signal Processing Society 16(4): 1121-1130, 2007

Adaptive segmentation for image coding (Original French Title: Segmentation adaptive pour le codage d'images). Signal Processing 13(4): 417-0, 1987

Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce. Plos one 12(3): E0174575, 2017