five

Multispecies pangenomes reveal a pervasive influence of population size on structural variation

收藏
DataONE2025-12-29 更新2026-01-03 收录
下载链接:
https://search.dataone.org/view/sha256:7ad5173706f287fe6354f21ef5d873fe3aaffe531d8d0ffc5479a9309236bab0
下载链接
链接失效反馈
官方服务:
资源简介:
Structural variants (SVs) are widespread in vertebrate genomes, yet their evolutionary dynamics remain poorly understood. Using 45 long-read de novo genome assemblies and pangenome tools, we analyze SVs within three closely related species of North American jays (Aphelocoma, scrub-jays) displaying a 60-fold range in effective population size. We find rapid evolution of genome architecture, including ~100 Mb variation in genome size driven by dynamic satellite landscapes with unexpectedly long (> 10 kb) repeat units and widespread variation in gene content, influencing gene expression. SVs exhibit slightly deleterious dynamics modulated by variant length and population size, with strong evidence of adaptive fixation only in large populations. Our results demonstrate how population size shapes the distribution of SVs and the importance of pangenomes to characterizing genomic diversity., Forty-four genomes from three species of North American scrub jays (Aphelocoma insularis, A. woodhouseii and A. coerulescens) and one outgroup (Yucatán Jay, Cyanocorax yucatanicus) were sequenced using PacBio HiFi technology.  The sequence reads were assembled into primary assemblies and two haplotype assemblies using hifiasm (Cheng et al. 2021). We used various pangenome tools, including the Pangenome Graph Builder (PGGB; Garrison et al. 2024) and minigraph (Li et al. 2020) to detect and characterize structural variants, including inversions, within and between species. We used RepeatModeler2 and RepeatMasker to annotate repetitive elements (Smit et al. 2015 , Flynn et al. 2020).  We conducted demographic analysis with PSMC (Li et al. 2011), bpp (Rannala et al. 2017) and other programs. We used Panacus to estimate growth curves for the pangenome graphs (Parmigiani et al. 2024), and fastDFE (Sendrowski et al. 2024) and anavar (Barton et al. 2018) to estimate the distribution o..., , # Data from: Multispecies pangenomes reveal pervasive influence of population size on evolution of structural variants [https://doi.org/10.5061/dryad.8pk0p2p01](https://doi.org/10.5061/dryad.8pk0p2p01) ## Description of the data and file structure ### Files and variables **File: RepeatMasker_analysis.tar.gz:**  **Description:** This file contains two files related to the analysis of RepeatMasker outputs: * **all_haps_repmask_nornd_cat_CS_CY.bed.gz** **Description:** This file contains a streamlined version of the output of RepeatMasker for each haplotype in the data set, including outgroups. The file is in bed format. The [RepeatMasker outfile](https://www.repeatmasker.org/webrepeatmaskerhelp.html) was converted to bed format by the [rmsk2bed command of bedops](https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/rmsk2bed.html). The file contains 6 columns: Reference contig of haplotype; start coordinate of repeat; end coordinate of repeat; the ty..., , **Changes after Jul 21, 2025:**  Several files were cleaned up and made less redundant, such as the pangene gene graphs (*.gfa), which were in two separate tar files. SVE also added several data tables pertaining to base composition, satellite DNA analysis, RepeatMasker analysis, and pangene gene graphs. These files include basic data tables as well as summarize some of the main results reported in the associated paper. The scripts used to generate these files are archived at Zenodo (DOI:10.5281/zenodo.16053688 - see below). **Changes after Aug 7, 2025:**  Updated the sj_annotations.tar.gz files, including now a gtf file with the gene names for easier integration with the pangene analysis, as well as the correct RepeatMasker annotation bed file of the AW reference. Also updated the file fasta library used in the RepeatMasker analysis, updating it to AW_365336_combined_repeats_v2.fasta.gz, which includes the satellites found by Satellite Repeat Finder. **Changes after Nov 1...

结构变异(Structural Variants, SVs)在脊椎动物基因组中广泛分布,但其演化动力学仍有待深入解析。本研究依托45组长读长从头基因组组装(long-read de novo genome assemblies)结果与泛基因组工具(pangenome tools),对3种亲缘关系紧密的北美灌丛松鸦属(Aphelocoma,灌丛松鸦)类群的结构变异展开分析,这些类群的有效种群大小(effective population size)差异可达60倍。研究发现基因组架构演化速率极快:动态变化的卫星DNA区域(其重复单元(repeat unit)长度意外超过10 kb)导致基因组大小出现约100 Mb的变异,同时伴随广泛的基因含量变异,进而对基因表达产生调控影响。结构变异呈现出受变异长度与种群大小共同调控的轻度有害演化动态(slightly deleterious dynamics),仅在大种群中存在显著的适应性固定(adaptive fixation)现象。本研究结果阐明了种群大小如何塑造结构变异的分布格局,同时证实了泛基因组在刻画基因组多样性方面的重要价值。 本研究对来自3种北美灌丛松鸦(岛松鸦Aphelocoma insularis、伍德豪斯松鸦A. woodhouseii以及蓝腹松鸦A. coerulescens)以及1个外类群(尤卡坦松鸦Cyanocorax yucatanicus)的44个基因组进行PacBio HiFi测序。利用hifiasm软件(Cheng等,2021)将测序读段组装为1份初级基因组组装结果与2份单倍型组装结果。本研究使用多款泛基因组分析工具,包括泛基因组图谱构建工具(Pangenome Graph Builder, PGGB;Garrison等,2024)与minigraph(Li等,2020),以检测并表征种内及种间的结构变异(包括倒位)。利用RepeatModeler2与RepeatMasker注释重复序列元件(Smit等,2015;Flynn等,2020)。通过PSMC(Li等,2011)、bpp(Rannala等,2017)及其他软件开展种群动力学分析。使用Panacus(Parmigiani等,2024)估算泛基因组图谱的生长曲线,借助fastDFE(Sendrowski等,2024)与anavar(Barton等,2018)估算变异的分布…… # 数据来源:多物种泛基因组揭示种群大小对结构变异演化的广泛影响 [https://doi.org/10.5061/dryad.8pk0p2p01] ## 数据与文件结构说明 ### 文件与变量 **文件:RepeatMasker_analysis.tar.gz:** **描述:** 该归档文件包含2份与RepeatMasker分析结果相关的文件: * **all_haps_repmask_nornd_cat_CS_CY.bed.gz** **描述:** 该文件为数据集中所有单倍型(含外类群)的RepeatMasker输出结果的精简版本,格式为BED格式。原始RepeatMasker输出文件通过bedops工具的rmsk2bed命令转换为BED格式,相关技术说明可参考[RepeatMasker官方输出帮助文档](https://www.repeatmasker.org/webrepeatmaskerhelp.html)与[bedops工具rmsk2bed命令文档](https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/rmsk2bed.html)。该文件包含6列信息:单倍型的参考序列contig;重复序列的起始坐标;重复序列的终止坐标;类型…… **2025年7月21日后的更新:** 对部分文件进行了清理并减少了冗余,例如原本分存于两个压缩包中的泛基因图谱文件(*.gfa)。此外新增了若干与碱基组成、卫星DNA分析、RepeatMasker分析及泛基因图谱相关的数据表格,这些文件既包含基础数据表,也汇总了相关论文中报道的部分核心结果。用于生成这些文件的脚本已存档于Zenodo(DOI:10.5281/zenodo.16053688,详见下文)。 **2025年8月7日后的更新:** 更新了sj_annotations.tar.gz文件,新增了包含基因名称的GTF文件,以便更便捷地与泛基因分析整合,同时修复了AW参考基因组的RepeatMasker注释BED文件。此外更新了RepeatMasker分析所用的fasta序列库,更新为AW_365336_combined_repeats_v2.fasta.gz,该库包含通过Satellite Repeat Finder鉴定得到的卫星序列。 **2025年11月后更新……
创建时间:
2025-12-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作