Genome-wide genic copy numbers for nine Macaca species based on 1kb windows

Mendeley Data2024-01-31 更新2024-06-27 收录

下载链接：

https://figshare.com/articles/dataset/Genome-wide_copy_numbers_for_enomes_of_nine_Macaca_species_based_on_1kb_windows/9900401/2

下载链接

链接失效反馈

官方服务：

资源简介：

This is the genome-wide copy numbers (CNs) for every 1kb window in the genomes of nine Macaca species, including the Chinese rhesus (M. mulatta lasiota, CR), cynomolgus (M. fascicularis, CE), Tibetan (M. thibetana, TM), stump-tailed (M. arctoides, SM), southern pig-tailed (M. nemestrina, PM), Japanese (M. fuscata, JM), Taiwanese (M. cyclopis, TwM), Barbary (M. sylvanus, BM), and lion-tailed (M. Silenus, LM) macaques. Methods: We employed FastQC (v0.11.8) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) to do quality control checks on raw data of resequencing genomes, then used Trimmomatic (v0.36) (Bolger et al. 2014) to filter and trim the reads. The cleaned reads were aligned to the Mmul_8 reference genome (Zimin et al. 2014) using BWA mem (Li and Durbin, 2009).Two steps were implemented to estimate CNs with the fastCN pipeline, which uses read depth information. The program fastCN was designed to efficiently estimate genome copy number from short read data (https://github.com/KiddLab/fastCN) (Pendleton et al. 2018). This method is built upon the mrsFAST aligner (Hach et al. 2014), and divides reads into 36-bp subreads and determines all possible matching locations on the reference genome with fewer than two substitutions, then reports per-bp read depth in an efficient compressed binary format. First, we performed the GC correction step using custom-defined control regions. The aim of GC correction is to remove the GC bias introduced by PCR during library preparation and sequencing. However, due to the lack of suitable control regions across macaques, we created a two-step process to retrieve copy number invariable (or control) regions in the diploid genome. The second step included the estimation of genome-wide CNs based on RDs using aligned BAM files based on windows containing one-kilobase (kb) unmasked, non-gap positions. RD values were converted to CNs using a correction factor (CF) calculated from average RD of the control regions. The calculation function is as follows:CF = RDctl / 2CN = RD / CFwhere CF stands for the correction factor, RD represents the read depth of specific genomic window, and RDctl is the mean read depth of the control region. Unplaced contigs were merged as ‘chrUn’ in data processing to decrease the CPU time. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114-2120.Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, et al. (2014) mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic acids research 42: W494-W500Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25: 1754-1760.Pendleton AL, Shen F, Taravella AM, Emery S, Veeramah KR, et al. (2018) Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication. BMC biology 16: 64Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, et al. (2014) A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biology direct 9: 1-15.

本数据集包含9种猕猴属（Macaca）物种基因组中每1kb窗口的全基因组范围拷贝数（copy numbers, CNs），涉及中国恒河猴（M. mulatta lasiota, CR）、食蟹猴（M. fascicularis, CE）、藏酋猴（M. thibetana, TM）、短尾猴（M. arctoides, SM）、南方豚尾猴（M. nemestrina, PM）、日本猴（M. fuscata, JM）、台湾猴（M. cyclopis, TwM）、巴巴利猕猴（M. sylvanus, BM）以及狮尾猴（M. silenus, LM）。 **研究方法**：本研究采用FastQC（v0.11.8）（http://www.bioinformatics.babraham.ac.uk/projects/fastqc）对基因组重测序原始数据开展质量控制，随后使用Trimmomatic（v0.36，Bolger等，2014）对测序reads进行过滤与修剪。经质控清理后的reads通过BWA mem（Li和Durbin，2009）比对至Mmul_8参考基因组（Zimin等，2014）。本研究借助fastCN流程完成拷贝数估算，该工具利用测序深度（read depth, RD）信息。fastCN程序专为从短读长测序数据中高效估算基因组拷贝数而开发（https://github.com/KiddLab/fastCN，Pendleton等，2018），其基于mrsFAST比对工具（Hach等，2014）构建，将reads拆分为36bp子reads，可在参考基因组上定位所有允许不超过2个碱基错配的匹配位点，并以高效压缩二进制格式输出每碱基的测序深度。第一步为GC校正步骤，采用自定义对照区域完成校正。GC校正的核心目的是消除文库制备与测序过程中PCR引入的GC偏好性。但由于猕猴类群缺乏通用的合适对照区域，本研究开发了两步流程以检索二倍体基因组中拷贝数恒定（即对照）的区域。第二步基于比对得到的BAM文件，以包含1kb未屏蔽、无间隙位点的窗口为单位，利用测序深度估算全基因组拷贝数。测序深度值通过校正因子（correction factor, CF）转换为拷贝数，校正因子由对照区域的平均测序深度计算得到，计算公式如下： CF = RD_ctl / 2 CN = RD / CF 其中CF为校正因子，RD为特定基因组窗口的测序深度，RD_ctl为对照区域的平均测序深度。为降低CPU计算耗时，数据处理过程中将未定位的重叠群（contigs）合并为"chrUn"。 ### 参考文献 1. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. *Bioinformatics* 30: 2114-2120. Bolger AM、Lohse M、Usadel B（2014）Trimmomatic：一款适配Illumina测序数据的灵活修剪工具。《生物信息学》30卷：2114-2120。 2. Hach F, Sarrafi I, Hormozdiari F, Alkan C, Eichler EE, et al. (2014) mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. *Nucleic Acids Research* 42: W494-W500. Hach F、Sarrafi I、Hormozdiari F、Alkan C、Eichler EE等（2014）mrsFAST-Ultra：一款适用于高性能测序分析的紧凑型SNP感知比对工具。《核酸研究》42卷：W494-W500。 3. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. *Bioinformatics* 25: 1754-1760. Li H、Durbin R（2009）基于Burrows-Wheeler变换的快速精准短读长序列比对。《生物信息学》25卷：1754-1760。 4. Pendleton AL, Shen F, Taravella AM, Emery S, Veeramah KR, et al. (2018) Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication. *BMC Biology* 16: 64. Pendleton AL、Shen F、Taravella AM、Emery S、Veeramah KR等（2018）家犬与灰狼基因组比较研究揭示神经嵴在犬类驯化中的作用。《BMC生物学》16卷：64。 5. Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, et al. (2014) A new rhesus macaque assembly and annotation for next-generation sequencing analyses. *Biology Direct* 9: 1-15. Zimin AV、Cornish AS、Maudhoo MD、Gibbs RM、Zhang X等（2014）适用于下一代测序分析的恒河猴新基因组组装与注释。《生物学直接》9卷：1-15。

创建时间：

2024-01-31