MassSpecGym

arXiv2024-10-30 更新2024-11-02 收录

下载链接：

https://github.com/pluskal-lab/MassSpecGym

下载链接

链接失效反馈

官方服务：

资源简介：

MassSpecGym是由捷克科学院有机化学与生物化学研究所等机构创建的一个综合基准数据集，旨在解决从MS/MS数据中发现和识别分子的问题。该数据集包含231,000个高质量的MS/MS光谱，代表了29,000个独特的分子结构，是目前公开的最大数据集。数据集的创建过程包括严格的质量评估和数据分割，确保数据的高质量和无泄漏。MassSpecGym的应用领域广泛，包括生物医学、化学科学、药物开发和环境分析等，旨在通过标准化数据集和评估协议，推动MS/MS光谱注释方法的发展。

MassSpecGym is a comprehensive benchmark dataset created by the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences and other institutions, aiming to address the challenge of discovering and identifying molecules from MS/MS data. This dataset contains 231,000 high-quality MS/MS spectra, representing 29,000 unique molecular structures, making it the largest publicly available dataset to date. The construction of MassSpecGym incorporates strict quality assessment and data splitting procedures to ensure high data quality and eliminate data leakage. MassSpecGym covers a wide range of application scenarios including biomedicine, chemical sciences, drug development and environmental analysis, and it aims to promote the development of MS/MS spectrum annotation methods via standardized datasets and evaluation protocols.

提供机构：

捷克科学院有机化学与生物化学研究所, 捷克技术大学信息学、机器人与网络研究所, 瓦赫宁根大学与研究中心生物信息学组, 多伦多大学计算机科学系, 耶拿弗里德里希·席勒大学计算机科学研究所生物信息学主席, 安特卫普大学计算机科学系, 阿尔伯塔大学计算机科学系, 阿尔伯塔机器智能研究所, 多伦多大学分子遗传学系, Bright Giant GmbH, 塔夫茨大学计算机科学系, 阿尔伯塔大学生物科学系, 阿尔托大学计算机科学系, 国家标准与技术研究院质谱数据中心, 塔夫茨大学化学与生物工程系, 杜塞尔多夫应用科学大学数字化与数字中心, 约翰内斯堡大学生物化学系, 瑞士联邦水科学与技术研究所

创建时间：

2024-10-30

原始信息汇总

MassSpecGym: A benchmark for the discovery and identification of molecules

数据集概述

MassSpecGym 提供三个挑战，用于基准测试从 MS/MS 光谱中发现和识别新分子的能力：

De novo 分子生成 (MS/MS 光谱 → 分子结构)
- 化学式挑战 (MS/MS 光谱 + 化学式 → 分子结构)
分子检索 (MS/MS 光谱 → 候选分子结构排名列表)
- 化学式挑战 (MS/MS 光谱 + 化学式 → 候选分子结构排名列表)
光谱模拟 (分子结构 → MS/MS 光谱)

数据集组件

MassSpecGym 数据集：可作为 Hugging Face 数据集使用，可通过代码下载到 pandas DataFrame。
数据转换：提供光谱和分子的转换工具，用于预处理机器学习模型的数据。
MassSpecDataModule：PyTorch Lightning 的 LightningDataModule，自动处理数据分割和批量加载。

模型实现

DeNovoMassSpecGymModel
RetrievalMassSpecGymModel
SimulationMassSpecGymModel

使用示例

数据加载

python from massspecgym.utils import load_massspecgym df = load_massspecgym()

数据集和转换

python from massspecgym.data import MassSpecDataset from massspecgym.transforms import SpecTokenizer, MolFingerprinter

dataset = MassSpecDataset( spec_transform=SpecTokenizer(n_peaks=60), mol_transform=MolFingerprinter(), )

数据模块

python from massspecgym.data import MassSpecDataModule

data_module = MassSpecDataModule( dataset=dataset, batch_size=32 )

模型训练与评估

python import torch import torch.nn as nn import pytorch_lightning as pl from pytorch_lightning import Trainer

from massspecgym.data import RetrievalDataset, MassSpecDataModule from massspecgym.data.transforms import SpecTokenizer, MolFingerprinter from massspecgym.models.base import Stage from massspecgym.models.retrieval.base import RetrievalMassSpecGymModel

class MyDeepSetsRetrievalModel(RetrievalMassSpecGymModel): def init( self, hidden_channels: int = 128, out_channels: int = 4096, # fingerprint size *args, **kwargs ): super().init(*args, **kwargs)

    self.phi = nn.Sequential(
        nn.Linear(2, hidden_channels),
        nn.ReLU(),
        nn.Linear(hidden_channels, hidden_channels),
        nn.ReLU(),
    )
    self.rho = nn.Sequential(
        nn.Linear(hidden_channels, hidden_channels),
        nn.ReLU(),
        nn.Linear(hidden_channels, out_channels),
        nn.Sigmoid()
    )

def forward(self, x: torch.Tensor) -> torch.Tensor:
    x = self.phi(x)
    x = x.sum(dim=-2)  # sum over peaks
    x = self.rho(x)
    return x

def step(
    self, batch: dict, stage: Stage
) -> tuple[torch.Tensor, torch.Tensor]:
    x = batch["spec"]  # input spectra
    fp_true = batch["mol"]  # true fingerprints
    cands = batch["candidates"]  # candidate fingerprints concatenated for a batch
    batch_ptr = batch["batch_ptr"]  # number of candidates per sample in a batch

    fp_pred = self.forward(x)
    loss = nn.functional.mse_loss(fp_true, fp_pred)
    fp_pred_repeated = fp_pred.repeat_interleave(batch_ptr, dim=0)
    scores = nn.functional.cosine_similarity(fp_pred_repeated, cands)

    return dict(loss=loss, scores=scores)

Init hyperparameters

n_peaks = 60 fp_size = 4096 batch_size = 32

Load dataset

dataset = RetrievalDataset( spec_transform=SpecTokenizer(n_peaks=n_peaks), mol_transform=MolFingerprinter(fp_size=fp_size), )

Init data module

data_module = MassSpecDataModule( dataset=dataset, batch_size=batch_size, num_workers=4 )

Init model

model = MyDeepSetsRetrievalModel(out_channels=fp_size)

Init trainer

trainer = Trainer(accelerator="cpu", devices=1, max_epochs=5)

Train

trainer.fit(model, datamodule=data_module)

Test

trainer.test(model, datamodule=data_module)

引用

bibtex @article{bushuiev2024massspecgym, title={MassSpecGym: A benchmark for the discovery and identification of molecules}, author={Roman Bushuiev and Anton Bushuiev and Niek F. de Jonge and Adamo Young and Fleming Kretschmer and Raman Samusevich and Janne Heirman and Fei Wang and Luke Zhang and Kai Dührkop and Marcus Ludwig and Nils A. Haupt and Apurva Kalia and Corinna Brungs and Robin Schmid and Russell Greiner and Bo Wang and David S. Wishart and Li-Ping Liu and Juho Rousu and Wout Bittremieux and Hannes Rost and Tytus D. Mak and Soha Hassoun and Florian Huber and Justin J. J. van der Hooft and Michael A. Stravs and Sebastian Böcker and Josef Sivic and Tomáš Pluskal}, year={2024}, eprint={2410.23326}, url={https://arxiv.org/abs/2410.23326}, doi={10.48550/arXiv.2410.23326} }

搜集汇总

数据集介绍

构建方式

MassSpecGym的构建方式体现了对高质标签MS/MS光谱数据集的全面整合。首先，研究团队从公开的谱库中广泛收集了MS/MS光谱数据，包括MoNA、MassBank和GNPS等。随后，通过一系列严格的数据清洗和标准化流程，确保了数据的可靠性和一致性。这些流程包括去除噪声信号、修正错误的元数据、标准化分子结构表示等。最终，MassSpecGym包含了231千个高质量的MS/MS光谱，代表了29千个独特的分子结构，成为迄今为止公开的最大规模的数据集。

特点

MassSpecGym的显著特点在于其规模和质量。作为首个全面的综合基准，它不仅提供了大规模的高质量标签MS/MS光谱，还定义了三个MS/MS注释挑战：从头分子结构生成、分子检索和光谱模拟。此外，MassSpecGym引入了新的评估指标和具有挑战性的数据分割方法，从而标准化了MS/MS注释任务，并使其易于被广泛的机器学习社区所接受。

使用方法

MassSpecGym的使用方法旨在简化机器学习模型的开发和评估过程。用户可以通过一个用户友好的界面，利用PyTorch Lightning和Hugging Face平台访问MassSpecGym。该平台允许用户在准备好的组件基础上构建新模型，并将结果提交到Papers With Code排行榜。通过这种方式，MassSpecGym不仅促进了可重复研究，还加速了新MS/MS光谱注释方法的开发。

背景与挑战

背景概述

MassSpecGym, introduced in 2024, is a pioneering benchmark dataset designed to facilitate the discovery and identification of molecules from tandem mass spectrometry (MS/MS) data. Developed by an interdisciplinary team of researchers from multiple institutions, including the Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences and the Czech Institute of Informatics, Robotics and Cybernetics, MassSpecGym addresses the critical need for standardized datasets and evaluation protocols in the field of metabolomics and chemical sciences. The dataset comprises the largest publicly available collection of high-quality labeled MS/MS spectra, encompassing over 231,000 spectra representing 29,000 unique molecular structures. MassSpecGym defines three core challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation, each aimed at advancing the state-of-the-art in MS/MS annotation tasks. This benchmark not only standardizes these tasks but also makes them accessible to the broader machine learning community, thereby fostering innovation and reproducibility in the field.

当前挑战

The primary challenge addressed by MassSpecGym is the exceptionally difficult task of decoding molecular structures from their mass spectra, even for human experts. This complexity arises from the vast majority of acquired MS/MS spectra remaining uninterpreted, thereby limiting our understanding of underlying biochemical processes. The dataset faces several significant challenges: 1) The heterogeneity of data acquired under different mass spectrometry settings complicates effective learning. 2) The scarcity of high-quality annotated spectra necessitates rigorous data cleaning and standardization. 3) Variations in data pre-processing techniques and inconsistencies in data splitting methods can lead to data leakage. 4) Differences in approaches to MS/MS annotation and varying evaluation metrics further hinder the development of robust machine learning algorithms. 5) The proprietary nature of many datasets restricts access and reproducibility. Addressing these challenges requires not only advanced machine learning techniques but also a comprehensive and standardized benchmark to ensure the reliability and generalizability of MS/MS annotation methods.

常用场景

经典使用场景

MassSpecGym 数据集的经典使用场景主要集中在分子发现和识别的基准测试中。该数据集包含了大量高质量的标记 MS/MS 光谱，定义了三个 MS/MS 注释挑战：从头分子结构生成、分子检索和光谱模拟。这些挑战将生物和环境样本中的科学发现过程抽象为定义明确的机器学习问题，使得广泛的研究者能够参与并推动 MS/MS 光谱注释技术的发展。

解决学术问题

MassSpecGym 数据集解决了在生物和环境样本中分子发现和识别的关键学术研究问题。通过提供标准化的数据集和评估协议，它显著减少了新方法开发的障碍，促进了机器学习在预测分子结构方面的应用。这对于理解生物化学过程、药物开发和环境分析具有重要意义，推动了跨学科研究的进展。

衍生相关工作

MassSpecGym 数据集的发布催生了一系列相关的经典工作，特别是在机器学习和计算代谢组学领域。例如，基于该数据集的分子结构生成和光谱模拟挑战，研究者们开发了多种深度学习模型，如 SMILES Transformer 和 SELFIES Transformer，这些模型在分子结构预测和光谱生成方面表现出色。此外，数据集的标准化数据分割和评估方法也为其他领域的基准测试提供了参考。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集