Harmonized GNPS + MassSpecGym MS/MS spectra with NPClassifier and ChemOnt labels, top128 release
收藏DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20042904
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains zstd-compressed MGF spectra from GNPS public libraries andMassSpecGym, harmonized with canonical SMILES, InChIKey, molecular formula,NPClassifier labels, and ChemOnt/ClassyFire labels. It is intended forbenchmarking and modelling structure-aware MS/MS classification. Labels arecomputational annotations, not experimentally curated ground truth.
File
`combined-gnps-mass-spec-gym-npc-faithful.harmonized-subset.top128.mgf.zst`
Spectra: `443,905`
Unique canonical SMILES: `45,007`
Unique SPLASH identifiers: `443,905`
Peaks per spectrum: `3-128`
Size: `182M`
SHA-256: `aee5a595960ef1284fe9977b6560a1ff771c07bb28c4a8e2e2ef47020b35724e`
Sources
| Source | Spectra | Unique SMILES ||---|---:|---:|| GNPS `ALL_GNPS.mgf` | `389,560` | `42,583` || MassSpecGym MGF | `54,345` | `9,120` |
Source URLs:
`https://external.gnps2.org/gnpslibrary/ALL_GNPS.mgf`
`https://huggingface.co/datasets/roman-bushuiev/MassSpecGym/resolve/main/data/auxiliary/MassSpecGym.mgf?download=true`
Annotations
NPClassifier labels were predicted with NPClassifier.rs `Faithful`, CUDA f32,batch size `2048`. Coverage:
NPClassifier: `7` pathways, `75` superclasses, `503` classes
ChemOnt: `2` kingdoms, `20` superclasses, `303` classes, `734` subclasses, `1,580` direct parents
Each MGF record includes source provenance, `SMILES`, `INCHIKEY`, `FORMULA`,`SPLASH`, NPC labels/scores, and ChemOnt labels.
Processing
Spectra were parsed and serialized with `mascot-rs` at commit`e9641034f75d019279a63f99d767d83a3bed55d1`. SMILES were canonicalized with`smiles-parser`; RDKit `2024.03.3` was used for tautomer-canonical SMILES andInChIKeys. Filters retained valid organic structures with non-zero precursorm/z, non-zero charge, `3-128` peaks, no post-top128 duplicate SPLASH collision,complete NPC labels, and no `(SPLASH, PEPMASS)` NPC/ChemOnt disagreement group.
Difference From The Top60 Release
Peak retention was increased from top `60` to top `128`.
SPLASH values and collision groups were recomputed after top128 peak retention.
Top128 base MGF: `567,068` mascot-valid spectra.
Top128 harmonized MGF before final mascot filtering: `444,187` spectra.
Final strict `mascot-rs` accept/reject pass dropped `282` records.
Final file fully loads with current `mascot-rs`: `443,905` validated records.
Validation
Missing `CHARGE`: `0`
Charge/ionmode mismatches: `0`
Missing `IONMODE`: `361`
Total retained peaks: `17,098,616`
提供机构:
Zenodo
创建时间:
2026-05-05



