D. simulans s49 and D. sechellia NF100 genomes
收藏www.repository.cam.ac.uk2025-03-26 收录
下载链接:
https://www.repository.cam.ac.uk/items/eea4b05b-f7fa-49c0-9d80-cfc9c2794cf0
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is associated with the following article: "Trans-Regulatory Changes Underpin the Evolution of the Drosophila Immune Response". This dataset includes the variant calling files and accordingly generated genomes of D. simulans Sz49 and D. sechellia DenisNF100. dsim49_24.vcf and dsecNF100_20.vcf are the corresponding variant calling files to Sz49 and DenisNF100. dsim49.fasta and dsecNF100.fasta are the corresponding genomes generated for Sz49 and DenisNF100 using the respective variant calling files. We used publicly available data to produce reference genomes for the Drosophila lines. Whole genome sequencing data for D. simulans Sz49 was obtained from NCBI Sequence Read Archive, under BioProject PRJNA318623 (BioSample: SAMN05157406). Whole genome sequencing data for D. sechellia DenisNF100 was obtained from NCBI Sequence Read Archive, under BioProject PRJNA395473 (BioSample: SAMN07407394). Genomic sequencing reads of Sz49 and DenisNF100 were mapped to D. simulans reference genome(FlyBase Dsim r2.01) and the D. sechellia reference genome (FlyBase Dsec r1.3) respectively using the default parameters of the BWA-MEM algorithm in the BWA package. Duplicated read pairs were removed using Markduplicates in the Picard toolkit (http://broadinstitute.github.io/picard/). Subsequently, variants were called against the reference genome using the GATK (v4.1.4.1) HaplotypeCaller toolkit. Single nucleotide polymorphisms (SNPs) and indels were separated for Base Quality Score Recalibrations (BQSR). Each round of BQSR (GATK toolkit) was performed using hard-filtered variants from the previous round as “true set”. For SNPs, the hard-filtering criteria were set with QualByDepth < 2.0, StrandOddRatio>3.0, FisherStrand > 60.0, RMSMappingQuality<40.0, MappingQualityRankSumTest < -12.5 and ReadPosRankSumTest < -8.0. For indels, the hard-filtering criteria were set with QualByDepth < 2.0, FisherStrand > 200.0, ReadPosRankSumTest < -20.0. BQSR were performed until the number of output variants plateaued or oscillated around a constant number. The finalised sets of variants after BQSR were then used to modify the reference genomes and generate Sz49 and DenisNF100 genomes with GATK FastaAlternateReferenceMaker toolkit.
本数据集与以下文章相关联:《转录调控变化支撑着果蝇免疫反应的进化》。该数据集包含了D. simulans Sz49和D. sechellia DenisNF100的变异调用文件及其相应生成的基因组。dsim49_24.vcf和dsecNF100_20.vcf分别是针对Sz49和DenisNF100的变异调用文件。dsim49.fasta和dsecNF100.fasta是利用相应的变异调用文件为Sz49和DenisNF100生成的对应基因组。在生成果蝇品系的参考基因组时,我们使用了公开可用的数据。D. simulans Sz49的全基因组测序数据从NCBI序列读取档案中获得,档案编号为BioProject PRJNA318623(生物样本编号:SAMN05157406)。D. sechellia DenisNF100的全基因组测序数据同样从NCBI序列读取档案中获得,档案编号为BioProject PRJNA395473(生物样本编号:SAMN07407394)。使用BWA软件包中BWA-MEM算法的默认参数,将Sz49和DenisNF100的基因组测序读段分别映射至D. simulans参考基因组(FlyBase Dsim r2.01)和D. sechellia参考基因组(FlyBase Dsec r1.3)。通过Picard工具包中的Markduplicates功能移除了重复的读段对。随后,利用GATK(v4.1.4.1)HaplotypeCaller工具包对参考基因组进行变异调用。将单核苷酸多态性(SNPs)和插入/缺失(indels)分别进行基础质量分数重新校准(BQSR)。每一轮的BQSR(GATK工具包)均使用前一轮硬过滤的变异作为“真实集”进行。对于SNPs,硬过滤的标准设定为QualByDepth < 2.0,StrandOddRatio>3.0,FisherStrand > 60.0,RMSMappingQuality<40.0,MappingQualityRankSumTest < -12.5和ReadPosRankSumTest < -8.0。对于indels,硬过滤的标准设定为QualByDepth < 2.0,FisherStrand > 200.0,ReadPosRankSumTest < -20.0。BQSR的执行将持续至输出变异的数量达到平台期或围绕一个常数数值波动。经过BQSR最终确定的变异集随后被用于修改参考基因组,并利用GATK FastaAlternateReferenceMaker工具包生成Sz49和DenisNF100的基因组。
提供机构:
www.repository.cam.ac.uk



