five

CHAMOIS datasets: Pfam domains and ChemOnt-classified metabolites for experimentally-verified BGCs.

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15009031
下载链接
链接失效反馈
官方服务:
资源简介:
Overview CHAMOIS is a fast method for predicting chemical features of natural products produced by Biosynthetic Gene Clusters (BGCs) using only their genomic sequence. It can be used to get chemical features from BGCs predicted in silico with tools such as GECCO or antiSMASH. It is developed by the Zeller Lab at LUMC and EMBL. This record contains the training data for CHAMOIS version 1.0, as well as the data and scripts supporting the analyses presented in the paper. The code for the CHAMOIS tool can be found in the zellerlab/CHAMOIS repository on GitHub. The various HDF5 files are intended to be opened with the anndata library (Virshup 2024) using the anndata.read_h5ad function. Alternatively, the observations metadata, variables metadata, and data tables are given as tab-separated-values (TSV) files in the same folder under the name obs.tsv, var.tsv and X.tsv respectively. Contents Each dataset is an archive containing the following files: features.hdf5 and features folder : The Pfam v36.0 domain vectors for each BGC of the dataset. classes.hdf5 and classes folder: The predicted ChemOnt classes for a selected compound of each BGC of the dataset. compound.json : A JSON file listing all compounds per BGC (as the BGCs in classes.hdf5 only have the classification for a single compound). types.tsv : The MIBiG types (Polyketide, NRP, RiPP, etc.) for each BGC of the dataset (if any). taxonomy.tsv : The taxonomy for the host of each BGC of the dataset (if any). CHAMOIS can be trained and evaluated on these domains directly using the chamois train and chamois cv commands: chamois train -f features.hdf5 -c classes.hdf5 -o model.json chamois cv -f features.hdf5 -c classes.hdf5 -o report.tsv   Datasets MIBiG 2.0 This dataset contains 1,517 annotated BGCs released in MIBiG 2.0 (Kautsar 2019), excluding some records from a manually curated list, and with manual corrections in BGC coordinates and compound assignment. It also excludes the BGCs that were deprecated or removed in MIBiG 3.1 to avoid low-quality entries. MIBiG 3.1 This dataset contains 2,068 annotated BGCs released in MIBiG 3.1 (Terlouw 2023), excluding some records from a manually curated list, and with manual corrections in BGC coordinates and compound assignment. Benchmark  This dataset contains 52 annotated BGCs, found in literature in their native context (the complete host genome) and used in the BGC screening benchmark of the CHAMOIS paper. The BGCs are distinct from the MIBiG 2.0 and 3.1 datasets, so it can be used as an external validation set if needed, although some clusters still exhibit moderate similarity. The dataset also contains the complete sequences of the 50 genomes containing the BGCs. PRISM 4 This dataset contains 1,267 annotated BGCs from the "Gold Standard BGCs" published in PRISM 4 (Skinnider 2020). It overlaps with the MIBiG datasets.

概述 CHAMOIS是一种仅通过基因组序列即可预测生物合成基因簇(Biosynthetic Gene Clusters,简称BGCs)所编码天然产物化学特征的快速方法。该工具可用于从GECCO、antiSMASH等工具计算机预测得到的BGC中提取化学特征,由莱顿大学医学中心(LUMC)的泽勒实验室与欧洲分子生物学实验室(EMBL)联合开发。 本数据集包含CHAMOIS v1.0的训练数据,以及支撑论文中各项分析的配套数据与脚本。CHAMOIS工具的源代码可在GitHub上的zellerlab/CHAMOIS仓库中获取。各类HDF5文件可通过anndata库的anndata.read_h5ad函数读取(Virshup 2024)。此外,观测元数据、变量元数据与数据表分别以制表符分隔值(Tab-Separated Values,简称TSV)文件形式存储于同一目录下,文件名为obs.tsv、var.tsv与X.tsv。 内容 各数据集归档文件包含以下内容: - features.hdf5与features文件夹:包含数据集中所有BGC的Pfam v36.0结构域向量。 - classes.hdf5与classes文件夹:包含数据集中每个BGC对应选定化合物的预测化学本体(ChemOnt)分类。 - compound.json:JSON格式文件,列出每个BGC对应的所有化合物(因classes.hdf5中的BGC仅记录单个化合物的分类信息)。 - types.tsv:包含数据集中每个BGC对应的MIBiG类型(如聚酮合酶、非核糖体肽、核糖体合成及翻译后修饰肽等,即Polyketide、NRP、RiPP等),如存在相关信息。 - taxonomy.tsv:包含数据集中每个BGC宿主的分类学信息,如存在相关信息。 可通过chamois train与chamois cv命令直接基于上述结构域对CHAMOIS进行训练与评估: chamois train -f features.hdf5 -c classes.hdf5 -o model.json chamois cv -f features.hdf5 -c classes.hdf5 -o report.tsv 数据集 ### MIBiG 2.0数据集 该数据集包含MIBiG 2.0(Kautsar 2019)发布的1517个已注释BGC,剔除了人工整理列表中的部分记录,并对BGC坐标与化合物分配信息进行了人工校正。同时为避免低质量条目,还移除了MIBiG 3.1中已弃用或删除的BGC。 ### MIBiG 3.1数据集 该数据集包含MIBiG 3.1(Terlouw 2023)发布的2068个已注释BGC,剔除了人工整理列表中的部分记录,并对BGC坐标与化合物分配信息进行了人工校正。 ### 基准测试数据集 该数据集包含52个已注释BGC,这些BGC在文献中以其原生环境(完整宿主基因组)形式被报道,用于CHAMOIS论文中的BGC筛选基准测试。本数据集的BGC与MIBiG 2.0、3.1数据集无重叠,因此可根据需要用作外部验证集,尽管部分簇仍存在中等程度的序列相似性。此外,该数据集还包含携带这52个BGC的50个完整基因组序列。 ### PRISM 4数据集 该数据集包含PRISM 4(Skinnider 2020)中发布的“金标准BGC”对应的1267个已注释BGC,与MIBiG数据集存在重叠。
创建时间:
2025-03-18
二维码
社区交流群
二维码
科研交流群
商业服务