five

Annotated sequences extracted from bacterial genomes

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7970589
下载链接
链接失效反馈
官方服务:
资源简介:
Three files containing sequences extracted from 1,049,210 bacterial genomes available from GenBank (release 252). Protein coding sequences were annotated with IDTAXA (PMID: 34541527) using taxon-specific KEGG groups (Bacteria_Protein_subset.fas.gz). These annotations were transferred to their corresponding (nucleotide) coding sequences (Bacteria_Nucleotide_subset.fas.gz). Intergenic regions were extracted from each genome and annotated by FindNonCoding (PMID: 34636849) for their overlap with any of 25 common bacterial non-coding RNAs in Rfam (v14). Intergenic regions were required to be at least 100 nucleotides long and contain no ambiguities (Bacteria_Intergenic_subset.fas.gz). Each subset contains only distinct sequences randomly ordered. Headers Sequence headers contain the assembly accession followed by the annotation and separated by a "|" character. For example: Bacteria_Intergenic_subset.fas.gz >GCA_022121725.1|RF00000 ATGTTACCTTCTTGAGTGATACGGGATGAA[...] Bacteria_Protein_subset.fas.gz >GCA_014764685.1|K02049 MPRDLIRISGLEKTYADGSVHALSNIDLSIKD[...] Bacteria_Nucleotide_subset.fas.gz >GCA_015948525.1|K02197 GTGAACCTGCGACGTAAAAACCGGCTAYG[...] Annotations Protein and protein coding sequences are labeled with their KEGG group, starting with "K". Intergenic sequences are named by any overlapping Rfam families, starting with a "RF", and separated by commas when multiple are predicted. "RF00000" is a placeholder for the absence of any predicted RF families.
创建时间:
2023-05-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作