five

Multifaceted quality assessment of gene repertoire annotation with OMArk

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/6462026
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset associated to the OMArk paper. Contain eight archives: Supplementary_Tables The Supplementary Table files referred to in the paper OMAmerDB: The OMAmer database constructed using the whole dataset of the OMA database (November 2022 Release) and used in the paper. An OMAmer database is necessary to run OMArk. Simulation:Proteomes with artificially introduced errors, contaminants or depleted completeness, used to assess OMArk's performance. The archive contains the generated proteomes (Simulated_Data) and their OMArk quality assessments (omark). They also contains the OMAmer results (OMAmerResults) that were used to run OMArk and BUSCO completeness assessments (BUSCO). *Note that for storage efficiency, only the non-redundant part of the data (added errors, added contamination, random fraction of proteomes) are stored there. The full modified proteome can be regenerated from these data and the source proteomes. Reference Proteomes: The UniProt Reference Proteomes (Proteomes) (2021_04) and their proteome quality assesment results according to OMArk. The archive contains the source proteome FASTA (Source folder),  OMAmer results for these proteomes (omamer folder) , OMArk results (omark folder), and BUSCO completeness assesments (BUSCO folder). It also contains a subfolder that contains part of the Contamination detection experiment (Contamination folder). Ensembl_Metazoa_AssemblyChange.Contains Ensembl Metazoa proteomes with version change between version 52 and 54 as well as their quality assesment resuls for both version. The archive contains the source proteomes FASTA (Source folder), a Splice file that group together all proteins coded by the same gene (Splice folder), omamer results for the proteomes (omamer folder) and the omark results (omark folder) MissingGenesBLASTContains sequences of HOGs considered as missing in the Human proteome, that was used to look for sequences in the human genome. Ensembl_NCBI_Results Contains OMArk and BUSCO results for Ensembl and NCBI proteomes. These results were then used to evaluate OMArk biais due to source of proteomes in the OMA database. NotebooksJupyter Notebooks that were used to perform the analysis described in the paper

本数据集关联OMArk(OMArk)论文,共包含8个归档文件,具体如下: 1. Supplementary_Tables:论文中提及的补充表格文件。 2. OMAmerDB:采用2022年11月版OMA数据库全数据集构建的OMAmer(OMAmer)数据库,为本论文所用,运行OMArk需依赖该数据库。 3. Simulation:包含人工引入错误、污染物或完整性缺失的蛋白质组,用于评估OMArk的性能。该归档包含生成的蛋白质组数据(Simulated_Data)、对应的OMArk质量评估结果(omark文件夹)、用于运行OMArk的OMAmer分析结果(OMAmerResults),以及BUSCO(BUSCO)完整性评估结果(BUSCO)。 *注:为提升存储效率,本归档仅存储数据的非冗余部分(含引入的错误、添加的污染物、随机选取的部分蛋白质组),完整修改后的蛋白质组可通过这些数据与源蛋白质组重新生成。 4. Reference Proteomes:包含2021年4月版UniProt参考蛋白质组(Proteomes)及其基于OMArk的蛋白质组质量评估结果。该归档包含源蛋白质组FASTA文件(Source文件夹)、这些蛋白质组的OMAmer分析结果(omamer文件夹)、OMArk评估结果(omark文件夹)、BUSCO完整性评估结果(BUSCO文件夹),还包含一个用于部分污染物检测实验的子文件夹(Contamination文件夹)。 5. Ensembl_Metazoa_AssemblyChange:包含版本介于52至54之间的Ensembl Metazoa蛋白质组,以及两个版本对应的质量评估结果。该归档包含源蛋白质组FASTA文件(Source文件夹)、用于将同一基因编码的所有蛋白质归为一组的剪接文件(Splice文件夹)、蛋白质组的OMAmer分析结果(omamer文件夹)以及OMArk评估结果(omark文件夹)。 6. MissingGenesBLAST:包含被判定为在人类蛋白质组中缺失的HOG(Hierarchical Orthologous Groups)序列,该数据集曾用于在人类基因组中检索对应序列。 7. Ensembl_NCBI_Results:包含针对Ensembl和NCBI蛋白质组的OMArk与BUSCO评估结果,这些结果被用于评估OMA数据库中因蛋白质组来源不同而产生的OMArk分析偏差。 8. Notebooks:包含用于完成论文中所述分析的Jupyter Notebooks(Jupyter笔记本)。
创建时间:
2023-10-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作