five

Bacillus pseudomycoides CHAES I 2_2 Prokka genome annotation

收藏
Figshare2026-02-20 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_i_Bacillus_pseudomycoides_i_CHAES_I_2_2_Prokka_genome_annotation/31382134
下载链接
链接失效反馈
官方服务:
资源简介:
A taxonomically targeted reference dataset was constructed for members of the family Bacillaceae. Taxonomic identifiers (taxIDs) corresponding to Bacillaceae species were obtained using the ETE3 NCBI Taxonomy toolkit, which locally mirrors the NCBI taxonomy database. The taxonomy database was updated using NCBITaxa().update_taxonomy_database(), after which all descendant taxIDs belonging to the Bacillaceae family were retrieved programmatically and saved as a plain-text list.Based on this taxID list, publicly available genome assemblies were downloaded from the NCBI GenBank database using the ncbi-genome-download utility. Genome assemblies classified as complete genome, chromosome, or scaffold level were retained. Downloaded data were organized in the standard GenBank directory structure, where each assembly directory contained genomic FASTA files (*_genomic.fna.gz), annotated protein sequences (*_protein.faa.gz), and corresponding GenBank annotation files (*_genomic.gbff.gz). For downstream functional annotation, annotated protein sequences from the downloaded Bacillaceae genomes were used as a custom reference dataset. Protein FASTA files (*_protein.faa.gz) were extracted from all GenBank assembly directories and combined into a single reference protein collection. This dataset served as the taxonomically restricted protein database for homology-based annotation.Genome annotation was performed using Prokka v1.15.6, executed within a Docker container (staphb/prokka:latest) to ensure software reproducibility and to avoid dependency conflicts associated with local installations.The target genome assembly (CHAES_2_2.fna) was annotated using Prokka in bacterial annotation mode. The annotation pipeline included:Prodigal for coding sequence predictionBLASTP searches against the custom Bacillaceae protein datasetIntegration of genus-level annotation heuristics where applicableAnnotation was executed with multi-threading enabled to improve performance. Output files included annotated GFF3, GenBank, protein FASTA, and nucleotide FASTA files, generated in a dedicated output directory.
创建时间:
2026-02-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作