Annotated sequences extracted from bacterial genomes

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/7970589

下载链接

链接失效反馈

官方服务：

资源简介：

Three files containing sequences extracted from 1,049,210 bacterial genomes available from GenBank (release 252). Protein coding sequences were annotated with IDTAXA (PMID: 34541527) using taxon-specific KEGG groups (Bacteria_Protein_subset.fas.gz). These annotations were transferred to their corresponding (nucleotide) coding sequences (Bacteria_Nucleotide_subset.fas.gz). Intergenic regions were extracted from each genome and annotated by FindNonCoding (PMID: 34636849) for their overlap with any of 25 common bacterial non-coding RNAs in Rfam (v14). Intergenic regions were required to be at least 100 nucleotides long and contain no ambiguities (Bacteria_Intergenic_subset.fas.gz). Each subset contains only distinct sequences randomly ordered. Headers Sequence headers contain the assembly accession followed by the annotation and separated by a "|" character. For example: Bacteria_Intergenic_subset.fas.gz >GCA_022121725.1|RF00000 ATGTTACCTTCTTGAGTGATACGGGATGAA[...] Bacteria_Protein_subset.fas.gz >GCA_014764685.1|K02049 MPRDLIRISGLEKTYADGSVHALSNIDLSIKD[...] Bacteria_Nucleotide_subset.fas.gz >GCA_015948525.1|K02197 GTGAACCTGCGACGTAAAAACCGGCTAYG[...] Annotations Protein and protein coding sequences are labeled with their KEGG group, starting with "K". Intergenic sequences are named by any overlapping Rfam families, starting with a "RF", and separated by commas when multiple are predicted. "RF00000" is a placeholder for the absence of any predicted RF families.

创建时间：

2023-05-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集