Bacillus pseudomycoides CHAES I 2_2 Prokka genome annotation
收藏Figshare2026-02-20 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_i_Bacillus_pseudomycoides_i_CHAES_I_2_2_Prokka_genome_annotation/31382134
下载链接
链接失效反馈官方服务:
资源简介:
A taxonomically targeted reference dataset was constructed for members of the family Bacillaceae. Taxonomic identifiers (taxIDs) corresponding to Bacillaceae species were obtained using the ETE3 NCBI Taxonomy toolkit, which locally mirrors the NCBI taxonomy database. The taxonomy database was updated using NCBITaxa().update_taxonomy_database(), after which all descendant taxIDs belonging to the Bacillaceae family were retrieved programmatically and saved as a plain-text list.Based on this taxID list, publicly available genome assemblies were downloaded from the NCBI GenBank database using the ncbi-genome-download utility. Genome assemblies classified as complete genome, chromosome, or scaffold level were retained. Downloaded data were organized in the standard GenBank directory structure, where each assembly directory contained genomic FASTA files (*_genomic.fna.gz), annotated protein sequences (*_protein.faa.gz), and corresponding GenBank annotation files (*_genomic.gbff.gz). For downstream functional annotation, annotated protein sequences from the downloaded Bacillaceae genomes were used as a custom reference dataset. Protein FASTA files (*_protein.faa.gz) were extracted from all GenBank assembly directories and combined into a single reference protein collection. This dataset served as the taxonomically restricted protein database for homology-based annotation.Genome annotation was performed using Prokka v1.15.6, executed within a Docker container (staphb/prokka:latest) to ensure software reproducibility and to avoid dependency conflicts associated with local installations.The target genome assembly (CHAES_2_2.fna) was annotated using Prokka in bacterial annotation mode. The annotation pipeline included:Prodigal for coding sequence predictionBLASTP searches against the custom Bacillaceae protein datasetIntegration of genus-level annotation heuristics where applicableAnnotation was executed with multi-threading enabled to improve performance. Output files included annotated GFF3, GenBank, protein FASTA, and nucleotide FASTA files, generated in a dedicated output directory.
创建时间:
2026-02-20



