RefSeq bacterial protein coding (nucleotide) sequences

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10031800

下载链接

链接失效反馈

官方服务：

资源简介：

Bacteria_Nucleotide.fas.gz 151,835,459 protein coding (nucleotide) sequences extracted from 44,831 randomly selected bacterial genomes from NCBI's RefSeq (release 220). Sequences are named by their accession number, followed by "|" and their PGAP predicted function ("protein" tag). For example, the first sequence is named: WP_125174066.1|iron ABC transporter permease The process of creating the file involved the following steps.Step 1. Download 318,613 faa and fna files associated with a bacterial assembly in RefSeq. The following query was used:esearch -db assembly -query '"Bacteria"[Organism] AND "latest refseq"[properties] AND "refseq has annotation"[properties]' | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeqStep 2. Verify all protein coding sequences match the expected protein sequence lengths within three codons, otherwise skip the assembly.Step 3. Remove all redundant protein coding or protein sequences in a genome. Only exact duplicates were removed, but they were removed from both nucleotides and proteins. Hence, a duplicated amino acid sequence would be discarded along with its coding sequence even if the coding sequence was unique. This was done to keep the two sets of sequences consistent.Step 4. Name sequences by their accession and PGAP predicted function, separated by a "|" character. The PGAP predicted function is generally uniform, although there are subtle difference between some taxon specific predictions. The predicted function is reasonably dependable but certainly not perfect.Step 5. Discard any sequences without a predicted function ("hypothetical protein"). These were discarded under the assumption that the protein's function would be required for downstream uses of the sequences.Step 6. Append protein and protein coding (nucleotide) sequences from randomly ordered assemblies to separate gzipped FASTA formatted files until the Zenodo file size limit was met for either file. Hence, there are many exact duplicate sequences in the set, but none for the sequences from each genome. The final sets of sequences are intended to provide large sets of matched protein coding (nucleotide) and protein (amino acid) sequences with consistent labels. The FASTA descriptions in both files are identical. Note, the protein coding sequences do not exactly translate into the protein sequences because of slight differences in length (typically inclusion/exclusion of the first or last codon), as well as use of different translation tables depending on the organism. See Related works for the companion file of protein (amino acid) sequences (DOI: 10.5281/zenodo.10030000).

创建时间：

2023-10-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集