Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs

DataONE2025-03-07 更新2025-04-26 收录

下载链接：

https://search.dataone.org/view/sha256:2ccaaec230966ed0c4fedeb7f3bb5fa3d68fd2bfdcaa9a8cd7774af05049bf85

下载链接

链接失效反馈

官方服务：

资源简介：

Plant nucleotide-binding leucine-rich repeat receptors (NLRs) are intracellular immune receptors crucial for pathogen recognition and immune responses. Despite their importance, NLRs are often challenging to annotate and frequently overlooked by standard annotation pipelines. To address the variability in NLR annotation accuracy across pipelines, we performed a harmonized de novo annotation of 230 high-quality superasterid genomes using the deep learning-based software Helixer (Holst et al. 2023), resulting in the annotation of 10,124,265 protein sequences. Additionally, we employed NLRtracker, which leverages InterProScan for domain identification, to detect NLR and NLR-associated sequences (Kourelis et al. 2021, Blum et al. 2025). Using the NLR definition from the RefPlantNLR dataset, we identified 91,366 NLRs, with counts ranging from 12 and 19 in the parasitic plants Cuscuta campestris and Orobanche coerulescens to 2,804 in Solanum tuberosum (potato). Beyond NLR annotation..., Helixer v0.3.2 (Stiehler et al. 2020; Holst et al. 2023) was executed using Singularity for genome FASTA files with the option '--lineage land_plant', which applies the default model (land_plant_v0.3_a_0080.h5) for land plants. Coding DNA sequences (CDS) and protein FASTA files were extracted from the output GFF files using GffRead v0.12.7 (Pertea and Pertea 2020) with the '-x' and '-y' options, respectively. The extracted protein sequences were then analyzed using NLRtracker (Kourelis et al. 2021), which integrates InterProScan v5.65-97.0 (Jones et al. 2014). BUSCO scores were generated using BUSCO v5.5.0 with [-m protein --lineage_dataset viridiplantae_odb10] options (Manni et al. 2021)., , # Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs [https://doi.org/10.5061/dryad.sxksn03d6](https://doi.org/10.5061/dryad.sxksn03d6) ## Description of the data and file structure ## **Abstract** Plant nucleotide-binding leucine-rich repeat receptors (NLRs) are intracellular immune receptors crucial for pathogen recognition and immune responses. Despite their importance, NLRs are often challenging to annotate and frequently overlooked by standard annotation pipelines. To address the variability in NLR annotation accuracy across pipelines, we performed a harmonized de novo annotation of 230 high-quality superasterid genomes using the deep learning-based software *Helixer* (Holst et al. 2023), resulting in the annotation of 10,124,265 protein sequences. Additionally, we employed *NLRtracker*, which leverages InterProScan for domain identification, to detect NLR and NLR-associated sequences (Kourelis et al. 2021, Blum et al. 2025). ...,

植物核苷酸结合富亮氨酸重复受体（Plant nucleotide-binding leucine-rich repeat receptors, NLRs）是一类关键的细胞内免疫受体，参与病原体识别与免疫应答过程。尽管NLRs功能至关重要，但这类蛋白往往难以注释，且常被标准注释流程所忽略。为解决不同注释流程下NLR注释准确性存在差异的问题，本研究基于深度学习软件Helixer（Holst等人，2023）对230个高质量超菊类基因组开展了统一化从头注释，共注释得到10124265条蛋白质序列。此外，本研究还采用了依托InterProScan进行结构域识别的工具NLRtracker，用于检测NLR及NLR相关序列（Kourelis等人，2021；Blum等人，2025）。结合RefPlantNLR数据集的NLR定义，本研究共鉴定得到91366个NLR，其数量范围从寄生植物田野菟丝子（Cuscuta campestris）与蓝花列当（Orobanche coerulescens）的12个至19个，至马铃薯（Solanum tuberosum）的2804个不等。除NLR注释环节外，本研究通过Singularity容器运行Helixer v0.3.2（Stiehler等人，2020；Holst等人，2023）处理基因组FASTA文件，所使用参数为`--lineage land_plant`，该参数将启用针对陆生植物的默认模型`land_plant_v0.3_a_0080.h5`。随后，我们使用GffRead v0.12.7（Pertea与Pertea，2020），分别通过`-x`和`-y`参数从输出的GFF文件中提取编码DNA序列（CDS）与蛋白质FASTA文件。将提取得到的蛋白质序列使用整合了InterProScan v5.65-97.0（Jones等人，2014）的NLRtracker（Kourelis等人，2021）进行分析。本研究通过BUSCO v5.5.0，采用参数`[-m protein --lineage_dataset viridiplantae_odb10]`生成BUSCO评估分数（Manni等人，2021）。 # 基于深度学习的230个超菊类基因组注释揭示了包含91366个NLR的标准化数据集 [https://doi.org/10.5061/dryad.sxksn03d6](https://doi.org/10.5061/dryad.sxksn03d6) ## 数据与文件结构说明 ### 摘要植物核苷酸结合富亮氨酸重复受体（NLRs）是一类关键的细胞内免疫受体，在病原体识别与免疫应答过程中发挥核心作用。尽管NLRs的功能至关重要，但这类蛋白往往难以完成注释，且常被标准注释流程所遗漏。为解决不同注释流程间NLR注释准确性存在差异的问题，本研究依托深度学习软件Helixer（Holst等，2023）对230个高质量超菊类基因组开展了统一化从头注释，共获得10,124,265条蛋白质序列注释结果。此外，本研究采用了借助InterProScan进行结构域识别的工具NLRtracker，以检测NLR及NLR相关序列（Kourelis等，2021；Blum等，2025）。……

创建时间：

2025-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集