Microeukaryotic Protein Database (VDB_Microeukaryotic_v1)

Name: Microeukaryotic Protein Database (VDB_Microeukaryotic_v1)
Creator: Espinoza, Josh
Published: 2022-12-21 00:00:00
License: 暂无描述

Figshare2022-12-21 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/Microeukaryotic_Protein_Database/19668855/2

下载链接

链接失效反馈

官方服务：

资源简介：

Version: VDB_Microeukaryotic_v1 Contains 4 files: -rw-r--r-- 1 jespinoz staff 10G Apr 18 19:46 reference.rmdup.iupac.relabeled.no_deprecated.complete_lineage.faa.gz -rw-r--r-- 1 jespinoz staff 167M Apr 18 19:40 target_to_source.dict.pkl.gz -rw-r--r-- 1 jespinoz staff 605K Apr 18 19:40 source_to_lineage.dict.pkl.gz -rw-r--r-- 1 jespinoz staff 542K Apr 18 19:42 source_taxonomy.tsv.gz * The main fasta protein file which is the dereplicated combination of NR (only protista and fungus), MMETSP, EukZoo, and EukProt. Only complete lineages are included since this is partially used for classification. * .pkl.gz are Python gzipped pickled dictionaries. * target_to_source.dict.pkl.gz has mapping between identifiers in fasta file and the original source * source_to_lineage.dict.pkl.gz has the mapping between source identifiers and lineage strings (e.g., c__Aconoidasida;o__Haemosporida;f__Haemoproteidae;g__Haemoproteus;s__Haemoproteus sp. hCWT4) * source_taxonomy.tsv.gz has the taxonomy for each source identifier Citation: * Espinoza, J.L., Dupont, C.L. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics 23, 419 (2022). https://doi.org/10.1186/s12859-022-04973-8 * Espinoza, Josh (2022): Microeukaryotic Protein Database. figshare. Dataset. https://doi.org/10.6084/m9.figshare.19668855.v1

数据集版本为VDB_Microeukaryotic_v1，共包含4个文件： 1. reference.rmdup.iupac.relabeled.no_deprecated.complete_lineage.faa.gz：权限为-rw-r--r--，所属用户与用户组为jespinoz staff，文件大小为10GB，修改时间为4月18日19:46。 2. target_to_source.dict.pkl.gz：文件大小为167MB，修改时间为4月18日19:40。 3. source_to_lineage.dict.pkl.gz：文件大小为605KB，修改时间为4月18日19:40。 4. source_taxonomy.tsv.gz：文件大小为542KB，修改时间为4月18日19:42。 * 主FASTA格式蛋白文件（即上述第一个文件）为去冗余整合数据集，其序列来源包括非冗余蛋白序列数据库（Non-Redundant Protein Sequence Database, NR，仅保留原生生物与真菌序列）、海洋微生物真核生物转录组测序项目（Marine Microbial Eukaryote Transcriptome Sequencing Project, MMETSP）、EukZoo以及EukProt。由于该数据集部分用于分类任务，因此仅保留具有完整分类谱系的序列。 * 后缀为.pkl.gz的文件为经gzip压缩的Python pickle序列化字典。 * target_to_source.dict.pkl.gz：存储FASTA格式文件中的序列标识符与原始来源标识符之间的映射关系。 * source_to_lineage.dict.pkl.gz：存储原始来源标识符与其分类谱系字符串之间的映射关系，示例格式为：c__Aconoidasida;o__Haemosporida;f__Haemoproteidae;g__Haemoproteus;s__Haemoproteus sp. hCWT4。 * source_taxonomy.tsv.gz：存储每个原始来源标识符对应的分类学信息。 **引用信息**： 1. Espinoza, J.L.、Dupont, C.L. 《VEBA：一款用于从宏基因组中计算机模拟恢复、聚类并分析原核生物、微真核生物与病毒基因组的模块化端到端套件》，发表于《BMC生物信息学》2022年第23卷，文章编号419，DOI: 10.1186/s12859-022-04973-8。 2. Espinoza, Josh (2022)：微真核生物蛋白数据库（Microeukaryotic Protein Database），收录于figshare的数据集，DOI: 10.6084/m9.figshare.19668855.v1。

提供机构：

Espinoza, Josh

创建时间：

2022-12-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集