Harmonized genome annotation of 346 Solanaceae species reveals 197,834 NLR immune receptors
收藏Zenodo2026-04-29 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19855163
下载链接
链接失效反馈官方服务:
资源简介:
The Solanaceae family encompasses many agriculturally important crops, including potato, tomato, eggplant, pepper, and tobacco, as well as a rich diversity of wild relatives that serve as reservoirs of genetic diversity. Plant nucleotide-binding leucine-rich repeat (NLR) receptors are intracellular immune sensors that mediate the recognition of pathogen-derived molecules and trigger disease resistance. To support comparative genomics and the study of plant immunity in this family, we curated a dataset of genome annotations for 346 Solanaceae genomes spanning 84 species, with harmonized gene models and NLR predictions. This release builds on our earlier Solanaceae NLRome resource (Sugihara, Toghani, Kourelis & Kamoun 2023; doi:10.5281/zenodo.10354350), which reported 66,665 NLRs from 124 published Solanaceae proteomes. The new release expands sampling to 346 genomes and replaces the heterogeneous, author-supplied annotations used previously with harmonized de novo annotations produced using the deep-learning software Helixer (Holst et al. 2023), thereby improving the consistency of gene models across the dataset. NLRs and NLR-associated sequences were then identified with NLRtracker (Kourelis et al. 2021). Using the NLR definition from the RefPlantNLR dataset, we recovered 197,834 NLRs from 15,078,455 predicted proteins. We provide the underlying genome assemblies, Helixer-derived gene annotation outputs, including proteome, CDS, and GFF files, and the corresponding complete NLRtracker output archives. The dataset also extends our recent superasterid annotation resource (Toghani et al. 2025; doi:10.5061/dryad.sxksn03d6) with a deeper, Solanaceae-focused sampling, and is intended to serve as a harmonized resource for comparative genomics, NLR evolution, and immune-receptor engineering in the nightshade family.
Methods
Helixer v0.3.2 (Stiehler et al. 2020; Holst et al. 2023) was executed using Singularity for genome FASTA files with the option '--lineage land_plant', which applies the default model (land_plant_v0.3_a_0080.h5) for land plants. Coding DNA sequences (CDS) and protein FASTA files were extracted from the output GFF files using GffRead v0.12.7 (Pertea and Pertea 2020) with the '-x' and '-y' options, respectively. The extracted protein sequences were then analyzed using NLRtracker (Kourelis et al. 2021), which integrates InterProScan v5.65-97.0 (Jones et al. 2014).
BUSCO scores were generated using BUSCO v6.0.0 with [-m protein --lineage_dataset solanales_odb12] options (Manni et al. 2021).
NLRtracker output legend:
File extension
Description
* _NLRtracker.tsv
NLRtracker overview output with gene status.
*_NLR.lst
Identifier list of NLRs.
*_NLR.gff3
NLR annotation of motifs, domains, and regions in GFF3 format.
*_NLR.fasta
NLR FASTA sequences.
*_NLR-associated.lst
Identifier list of NLR associated genes.
*_NLR-associated.gff3
NLR associated genes annotation of motifs, domains, and regions in GFF3 format.
*_NLR_associated.fasta
NLR associated genes FASTA sequences.
*_NBARC.fasta
NB-ARC domain FASTA sequences.
*_NBARC_deduplictated.fasta
Deduplicated NB-ARC domain FASTA sequences.
*_iTOL.txt
Domain annotation file for iTOL.
*_iTOL_dedup.txt
Domain annotation file of the deduplicated sequences for iTOL.
*_Domains.tsv
Full-length and domain sequence and metadata for all NLRtracker output.
interpro_result.gff
InterProScan output of the query proteome.
Raw Genome Files
Tarball
Contents (top-level folders inside the tar)
genomes_part1.tar
NCBI_new/, benthi_genomes/, solgenomics_others/, Lin_et_al_2023_Solanum_americanum/
genomes_part2.tar
From_HelixerDB/, Tang_et_al_2022_potatoes/, Sun_et_al_2025_European_potatoes_renamed/
genomes_part3.tar
Benoit_et_al_2025_Solanum/, Cheng_et_al_2025_potatoes/, Zhou_et_al_2022_tomatoes/, NCBI/
提供机构:
Zenodo
创建时间:
2026-04-29



