five

RefSeq bacterial protein coding (nucleotide) sequences

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10031800
下载链接
链接失效反馈
官方服务:
资源简介:
Bacteria_Nucleotide.fas.gz 151,835,459 protein coding (nucleotide) sequences extracted from 44,831 randomly selected bacterial genomes from NCBI's RefSeq (release 220). Sequences are named by their accession number, followed by "|" and their PGAP predicted function ("protein" tag). For example, the first sequence is named: WP_125174066.1|iron ABC transporter permease The process of creating the file involved the following steps.Step 1. Download 318,613 faa and fna files associated with a bacterial assembly in RefSeq. The following query was used:esearch -db assembly -query '"Bacteria"[Organism] AND "latest refseq"[properties] AND "refseq has annotation"[properties]' | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeqStep 2. Verify all protein coding sequences match the expected protein sequence lengths within three codons, otherwise skip the assembly.Step 3. Remove all redundant protein coding or protein sequences in a genome. Only exact duplicates were removed, but they were removed from both nucleotides and proteins. Hence, a duplicated amino acid sequence would be discarded along with its coding sequence even if the coding sequence was unique. This was done to keep the two sets of sequences consistent.Step 4. Name sequences by their accession and PGAP predicted function, separated by a "|" character. The PGAP predicted function is generally uniform, although there are subtle difference between some taxon specific predictions. The predicted function is reasonably dependable but certainly not perfect.Step 5. Discard any sequences without a predicted function ("hypothetical protein"). These were discarded under the assumption that the protein's function would be required for downstream uses of the sequences.Step 6. Append protein and protein coding (nucleotide) sequences from randomly ordered assemblies to separate gzipped FASTA formatted files until the Zenodo file size limit was met for either file. Hence, there are many exact duplicate sequences in the set, but none for the sequences from each genome. The final sets of sequences are intended to provide large sets of matched protein coding (nucleotide) and protein (amino acid) sequences with consistent labels. The FASTA descriptions in both files are identical. Note, the protein coding sequences do not exactly translate into the protein sequences because of slight differences in length (typically inclusion/exclusion of the first or last codon), as well as use of different translation tables depending on the organism. See Related works for the companion file of protein (amino acid) sequences (DOI: 10.5281/zenodo.10030000).
创建时间:
2023-10-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作