five

Cannabis Pangenome Annotation Data

收藏
DataCite Commons2024-05-30 更新2024-07-13 收录
下载链接:
https://plus.figshare.com/articles/dataset/Cannabis_Pangenome_Annotation_Data/25909024
下载链接
链接失效反馈
官方服务:
资源简介:
<b>Abstract</b><i>Cannabis sativa</i> is a globally significant seed-oil, fiber, and drug-producing plant species. However, a century of prohibition has severely restricted legal breeding and germplasm resource development, leaving potential hemp-based nutritional and fiber applications unrealized. Existing cultivars are highly heterozygous and lack competitiveness in the overall fiber and grain markets, relegating hemp to less than 200,000 hectares globally<sup>1</sup>. The relaxation of drug laws in recent decades has generated widespread interest in expanding and reincorporating cannabis into agricultural systems, but progress has been impeded by the limited understanding of genomics and breeding potential. No studies to date have examined the genomic diversity and evolution of cannabis populations using haplotype-resolved, chromosome-scale assemblies from publicly available germplasm. Here we present a cannabis pangenome, constructed with 181 new and 12 previously released genomes from a total of 156 biological samples from both male (XY) and female (XX) plants, including 42 trio phased and 36 haplotype-resolved, chromosome-scale assemblies. We discovered widespread regions of the cannabis pangenome that are surprisingly diverse for a single species, with high levels of genetic and structural variation, and propose a novel population structure and hybridization history. Conversely, the cannabinoid synthase genes contain very low levels of diversity, despite being embedded within a variable region containing multiple pseudogenized paralogs and distinct transposable element arrangements. Additionally, we identified variants of <i>acyl-lipid thioesterase </i>(<i>ALT</i>) genes<sup>2</sup> that are associated with fatty acid chain length variation and the production of the rare cannabinoids, tetrahydrocannabinol varin (THCV) and cannabidiol varin (CBDV). We conclude the <i>Cannabis sativa </i>gene pool has only been partially characterized, and that the existence of wild relatives in Asia remains likely, while its potential as a crop species remains largely unrealized.1. Nions, U. Commodities at a glance: Special issue on industrial hemp. <i>Commod Glance</i> (2023) doi:10.18356/9789210019958.2. Pulsifer, I. P. <i>et al.</i> Acyl-lipid thioesterase1-4 from Arabidopsis thaliana form a novel family of fatty acyl-acyl carrier protein thioesterases with divergent expression patterns and substrate specificities. <i>Plant Mol. Biol.</i> <b>84</b>, 549–563 (2014).<b>Transposable element analysis</b>To identify transposable elements, we used the EDTA pipeline with default settings. EDTAOutput.tar.gz includes EDTA transposon annotations for 78 scaffolded, chromosome-level cannabis genomes.<b>Structural Variation analysis </b>The 78 fully scaffolded assembly haplotypes were each aligned to the EH23a assembly using minimap2 (Heng Li 2018). Syri was then used to call structural variations on each alignment (Goel et al. 2019) and plotsr was used to visualize alignments and SVs (Goel and Schneeberger 2022). DUP_query_coord.bed.tar.gz includes duplications for 78 assemblies with EH23a as referenceINVTR_query_coord.bed.tar.gz includes inverted translocations for 78 assemblies with EH23a as referenceINVs_query_coord.bed.tar.gz includes inversions for 78 assemblies with EH23a as referenceTRANS_query_coord.bed.tar.gz includes translocations for 78 assemblies with EH23a as reference<br>csat_orientations.tsv is a scaffold orientation file for 78 assemblies with EH23a as reference

<b>摘要</b> <i>大麻(Cannabis sativa)</i>是全球重要的油料、纤维与药用植物物种。然而,长达一个世纪的禁令严重限制了合法育种与种质资源开发,使得基于大麻的营养与纤维应用潜力未能实现。现有栽培品种杂合度极高,在整体纤维与谷物市场中缺乏竞争力,导致全球大麻种植面积不足20万公顷<sup>1</sup>。近几十年来,药物法案的放宽引发了人们对扩大大麻种植并将其重新纳入农业系统的广泛兴趣,但由于对基因组学与育种潜力的认知有限,研究进展受阻。迄今为止,尚无研究利用公开可用种质的单倍型解析、染色体级别的基因组组装来探究大麻种群的基因组多样性与演化历程。本研究构建了大麻泛基因组,其数据来源于156份生物样本的181个新测序基因组与12个已发布基因组,样本涵盖雄性(XY)与雌性(XX)植株,其中包含42个trio分型与36个单倍型解析的染色体级组装。我们发现大麻泛基因组中存在大量对于单一物种而言惊人多样的区域,其遗传变异与结构变异水平极高,并提出了一种全新的种群结构与杂交历史模型。与之相反,尽管大麻素合酶基因嵌入在一个包含多个假基因化旁系同源基因与独特转座因子排列的可变区域中,但其多样性水平极低。此外,我们鉴定出与脂肪酸链长度变异以及稀有大麻素四氢大麻酚变种(THCV)和大麻二酚变种(CBDV)生成相关的<i>酰基脂质硫酯酶</i>(<i>acyl-lipid thioesterase</i>, ALT)基因变体<sup>2</sup>。我们得出结论:大麻(Cannabis sativa)的基因库仅得到部分表征,亚洲地区仍存在野生近缘种的可能性,而其作为作物物种的潜力在很大程度上尚未得到发掘。 1. 尼恩斯, U. 《大宗商品概览:工业大麻特刊》. <i>大宗商品概览</i> (2023) doi:10.18356/9789210019958. 2. 帕尔西弗, I. P. 等. 拟南芥<i>acyl-lipid thioesterase</i> 1-4构成一个新型脂肪酰基-酰基载体蛋白硫酯酶家族,其表达模式与底物特异性存在差异. <i>植物分子生物学</i> <b>84</b>, 549–563 (2014). <b>转座因子分析</b> 为鉴定转座因子,我们采用默认参数的EDTA分析流程。EDTAOutput.tar.gz包含78个染色体级组装的大麻基因组的EDTA转座子注释文件。 <b>结构变异分析</b> 将78个完整组装的单倍型基因组通过minimap2(李恒,2018)比对至EH23a参考基因组。随后使用Syri对每一组比对结果调用结构变异(Goel等,2019),并通过plotsr可视化比对结果与结构变异(Goel与Schneeberger,2022)。 DUP_query_coord.bed.tar.gz包含以EH23a为参考的78个基因组组装的重复区域数据; INVTR_query_coord.bed.tar.gz包含以EH23a为参考的78个基因组组装的倒位易位数据; INVs_query_coord.bed.tar.gz包含以EH23a为参考的78个基因组组装的倒位区域数据; TRANS_query_coord.bed.tar.gz包含以EH23a为参考的78个基因组组装的易位区域数据。 csat_orientations.tsv是一份以EH23a为参考的78个基因组组装的支架方向文件。
提供机构:
Figshare+
创建时间:
2024-05-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作