five

Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain

收藏
Figshare2025-05-22 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Overcoming_limitations_to_customize_DeepVariant_for_domesticated_animals_with_TrioTrain/29945933
下载链接
链接失效反馈
官方服务:
资源简介:
ABSTRACT Generating high-quality variant callsets across diverse species remains challenging as most bioinformatics tools default to assumptions based on human genomes. DeepVariant (DV) excels without joint genotyping while offering fewer implementation barriers. However, the growing appeal of a “universal” algorithm has magnified the unknown impacts when used with non-human species. We use bovine genomes to assess the limits of using human-genome-trained variant callers, including the allele frequency channel (DV-AF) and joint-caller DeepTrio (DT). Our novel approach, TrioTrain, automates extending DV for diploid species lacking Genome-in-a-Bottle (GIAB) resources, using a region shuffling approach to mitigate barriers for SLURM-based clusters. Imperfect animal truth labels are curated to remove Mendelian discordant sites before training DV to genotype the offspring correctly. With TrioTrain, we use cattle, yak, and bison trios to create the first multi-species-trained DV-AF checkpoint. Although incomplete bovine truth sets constrain recall within challenging repetitive regions, we observe a mean SNV F1 score >0.990 across new checkpoints during GIAB benchmarking. With HG002, a bovine-trained checkpoint (28) decreased the Mendelian Inheritance Error (MIE) rate by a factor of two compared to the default (DV). Checkpoint 28 has a mean MIE rate of 0.03 percent in three bovine interspecies cross genomes. These results illustrate that a multi-species, trio-based training strategy reduces inheritance errors during single-sample variant calling. While exclusively training with human genomes deters transferring deep-learning-based variant calling to new species, we use the diverse ancestry within bovids to illustrate the need for advanced tools designed for comparative genomics. TrioTrain_README.md README file that describes the contents and purpose of these files in further detail. TrioTrain_project_metadata.csv Pedigree and breed labels for all bovine samples included in this study. CallableRegions.tar.gz Per-sample callable region files. After cohort QC, we generated truth sets based on the UMAG1 cohort using GATK-derived genotypes. The regions files produced by GATK (v3.8-1-0-gf15c1c3ef), followed by parsing per-sample CallableLoci to extract only PASS regions for downstream analyses. UMAG1.POP.FREQ.vcf.gz UMAGv1 cohort population allele frequency file. ReferenceGenome.tar.gz Bovine reference genome files ModelCheckpoint.tar.gz Final selected TrioTrain checkpoint (28). This file is compatible with DeepVariant (v1.4) for short-read, whole-genome-sequencing (WGS) data. Using this alternative checkpoint requires a Population VCF compatible with the reference genome provided to DeepVariant. DV-TrioTrain-0.8.tar.gz The source code for the TrioTrain pipeline (v0.8) at the time of publication. Additional information, including installation instructions, are available on Github: https://github.com/jkalleberg/DV-TrioTrain/releases/tag/v0.8
创建时间:
2025-05-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作