The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Mendeley Data2024-04-13 更新2024-06-27 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.qnk98sfpt

下载链接

链接失效反馈

官方服务：

资源简介：

# The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars The dataset contains two items: **(i)** syntenic alignments between *C. canephora*, *C. eugenioides* and *C. arabica* assemblies, and **(ii)** the variant calls used in the population analyses in the paper. ## Description of the data and file structure **i.** Syntenic alignments have been obtained in CoGe SynMap tool using default settings. In file names, the first two items give the CoGe IDs of the genomes being aligned: *C. canephora* - ID50947; *C. eugenioides* - ID51132; *C. arabica* subCC - ID65471; *C. arabica* subEE - ID65472; *C. arabica* - ID65463. The contents of the columns in syntenic alignments are described on row 3 of the files, and on row 1 in tandem duplicate files (which can be identified as having .*tandems.* in their names). **ii.** The variant calls are given in VCF formatted files. Each subgenome has its own file, Arabica_sgC.TIP.BB contains variant calls for subgenome CC of *C. arabica*, and Arabica_sgE.TIP.BB for subgenome EE, respectively. Variants have been filtered for SNPs that were called as heterozygous in di-haploid *C. arabica* accession Et39, but otherwise no quality filtering for the variants has been done in these files. See Supplementary Material, Section 6.2 for the specific filterings carried out in the publication. Mapping between the sequencing IDs and accession names is given in the provided Excel sheet (Accession_info.xlsx). Seq.ID (column B) shows the accession ID used in sequencing. accession_name (column C) gives the name of the accession/cultivar. Species_name (Column D) provides the species of the accession, three different Coffea species were analysed in this study. Variety (column E) gives information on the cultaivation status, Introgressed identifies *C. arabica* x *C. canephora* hybrids. Columns F-J provide the place of origin of the accession (district/location, country, as well as GPS coordinates). The cells are left empty if the exact value (GPS coordinate or altitude) is not known. Columns K-M provide genome information, ploidy level, estimated genome size and genome structure. Columns N-R give additional information, donor institute, collection location, additional notes on the accession as well as original reference. If exact collection location is not known the cell is left empty. In those cases the material has been obtained from line(s) maintained by the donor institute (column N). Cells in columns Q-R are left empty if there is no (known) original publication associated with the accession. Code/Software: For the syntenic alignments, CoGe platform was used. For the variant calls,. Linux operating system and GATK was used to obtain the VCF files, subsequent analysis was carried out using R, Plink, vcftools, smc++.

# 异源四倍体阿拉比卡咖啡的基因组与群体基因组学研究揭示现代咖啡栽培品种的分化历史本数据集包含两部分内容：**(i)** 小果咖啡（*C. canephora*）、优吉欧伊德斯咖啡（*C. eugenioides*）与阿拉比卡咖啡（*C. arabica*）组装基因组之间的共线性比对（syntenic alignments）结果，以及**(ii)** 本论文群体基因组分析中使用的变异位点调用集。 ## 数据与文件结构说明 **i.** 共线性比对通过CoGe SynMap工具采用默认参数完成。文件名的前两段代表待比对基因组的CoGe编号：小果咖啡（*C. canephora*）对应ID50947；优吉欧伊德斯咖啡（*C. eugenioides*）对应ID51132；阿拉比卡咖啡CC亚基因组（*C. arabica* subCC）对应ID65471；阿拉比卡咖啡EE亚基因组（*C. arabica* subEE）对应ID65472；完整阿拉比卡咖啡基因组（*C. arabica*）对应ID65463。共线性比对文件的列信息说明位于文件第3行，串联重复文件（文件名含".*tandems.*"标识）的列信息说明则位于文件第1行。 **ii.** 变异位点调用结果以变体调用格式（VCF）文件存储。每个亚基因组对应独立文件：其中"Arabica_sgC.TIP.BB"为阿拉比卡咖啡CC亚基因组的变异位点调用集，"Arabica_sgE.TIP.BB"对应EE亚基因组。本次过滤仅保留了在双单倍体阿拉比卡咖啡材料Et39中被判定为杂合的单核苷酸多态性（Single Nucleotide Polymorphism, SNP）位点，未对变异位点进行其他质量过滤。论文中使用的具体过滤流程详见补充材料第6.2节。测序ID与材料编号的对应关系见配套Excel表格"Accession_info.xlsx"：其中Seq.ID（B列）为测序使用的材料编号；"accession_name"（C列）为材料/栽培品种名称；"Species_name"（D列）为该材料所属物种，本研究共分析了3种咖啡属物种；"Variety"（E列）为栽培状态信息，"Introgressed"代表阿拉比卡咖啡与小果咖啡的渐渗杂交品种。F-J列记录了材料的起源地信息（行政区/采集地点、国家及GPS坐标），若未知精确坐标或海拔，则对应单元格留空。K-M列记录基因组相关信息，包括倍性水平、预估基因组大小及基因组结构。N-R列包含补充信息：捐赠机构、采集地点、材料相关备注及原始参考文献。若未知精确采集地点，则对应单元格留空，此类材料取自捐赠机构（N列）维护的种质株系。若该材料无公开原始参考文献，则Q-R列留空。 **代码与软件**：共线性比对分析使用CoGe平台完成。变异位点调用阶段采用Linux操作系统与GATK工具生成VCF文件，后续分析则通过R、Plink、vcftools及smc++完成。

创建时间：

2024-01-08