five

~500k MS2 spectra with 70k SMILES, InChIKey, NPC & ClassyFire annotations

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20039648
下载链接
链接失效反馈
官方服务:
资源简介:
# Harmonized GNPS + MassSpecGym MS/MS spectra with NPClassifier and ChemOnt labels This dataset contains zstd-compressed MGF spectra from GNPS public libraries andMassSpecGym, harmonized with canonical SMILES, InChIKey, molecular formula,NPClassifier labels, and ChemOnt/ClassyFire labels. It is intended forbenchmarking and modelling structure-aware MS/MS classification. Labels arecomputational annotations, not experimentally curated ground truth. ## Annotations NPClassifier labels were predicted with NPClassifier.rs `Faithful`, CUDA f32,batch size `2048`. Coverage: - NPClassifier: `7` pathways, `75` superclasses, `503` classes- ChemOnt: `2` kingdoms, `20` superclasses, `303` classes, `734` subclasses, `1,580` direct parents Each MGF record includes source provenance, `SMILES`, `INCHIKEY`, `FORMULA`,`SPLASH`, NPC labels/scores, and ChemOnt labels. ## Processing Spectra were parsed and serialized with `mascot-rs`. SMILES were canonicalizedwith `smiles-parser`; RDKit `2024.03.3` was used for tautomer-canonical SMILESand InChIKeys. Filters retained valid organic structures with non-zero precursorm/z, non-zero charge, `3-60` peaks, no post-top60 duplicate SPLASH collision,complete NPC labels, and no `(SPLASH, PEPMASS)` NPC/ChemOnt disagreement group. ## Fixed Compared To The Previous Version - Updated to latest `mascot-rs`: `e9641034f75d019279a63f99d767d83a3bed55d1`.- Regenerated from GNPS and MassSpecGym sources instead of reusing the older corrupted MGF.- Did not patch contradictory metadata; records rejected by current `mascot-rs` were dropped.- Removed charge/ionmode conflicts: final mismatch count is `0`.- Added a final strict `mascot-rs` accept/reject pass; `282` harmonized records failed validation and were excluded.- Final file fully loads with current `mascot-rs`: `439,403` validated records.
提供机构:
Zenodo
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作