five

Harmonized GNPS + MassSpecGym MS/MS spectra with NPClassifier and ChemOnt labels, top128 release

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20042904
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains zstd-compressed MGF spectra from GNPS public libraries andMassSpecGym, harmonized with canonical SMILES, InChIKey, molecular formula,NPClassifier labels, and ChemOnt/ClassyFire labels. It is intended forbenchmarking and modelling structure-aware MS/MS classification. Labels arecomputational annotations, not experimentally curated ground truth. File `combined-gnps-mass-spec-gym-npc-faithful.harmonized-subset.top128.mgf.zst` Spectra: `443,905` Unique canonical SMILES: `45,007` Unique SPLASH identifiers: `443,905` Peaks per spectrum: `3-128` Size: `182M` SHA-256: `aee5a595960ef1284fe9977b6560a1ff771c07bb28c4a8e2e2ef47020b35724e` Sources | Source | Spectra | Unique SMILES ||---|---:|---:|| GNPS `ALL_GNPS.mgf` | `389,560` | `42,583` || MassSpecGym MGF | `54,345` | `9,120` | Source URLs: `https://external.gnps2.org/gnpslibrary/ALL_GNPS.mgf` `https://huggingface.co/datasets/roman-bushuiev/MassSpecGym/resolve/main/data/auxiliary/MassSpecGym.mgf?download=true` Annotations NPClassifier labels were predicted with NPClassifier.rs `Faithful`, CUDA f32,batch size `2048`. Coverage: NPClassifier: `7` pathways, `75` superclasses, `503` classes ChemOnt: `2` kingdoms, `20` superclasses, `303` classes, `734` subclasses, `1,580` direct parents Each MGF record includes source provenance, `SMILES`, `INCHIKEY`, `FORMULA`,`SPLASH`, NPC labels/scores, and ChemOnt labels. Processing Spectra were parsed and serialized with `mascot-rs` at commit`e9641034f75d019279a63f99d767d83a3bed55d1`. SMILES were canonicalized with`smiles-parser`; RDKit `2024.03.3` was used for tautomer-canonical SMILES andInChIKeys. Filters retained valid organic structures with non-zero precursorm/z, non-zero charge, `3-128` peaks, no post-top128 duplicate SPLASH collision,complete NPC labels, and no `(SPLASH, PEPMASS)` NPC/ChemOnt disagreement group. Difference From The Top60 Release Peak retention was increased from top `60` to top `128`. SPLASH values and collision groups were recomputed after top128 peak retention. Top128 base MGF: `567,068` mascot-valid spectra. Top128 harmonized MGF before final mascot filtering: `444,187` spectra. Final strict `mascot-rs` accept/reject pass dropped `282` records. Final file fully loads with current `mascot-rs`: `443,905` validated records. Validation Missing `CHARGE`: `0` Charge/ionmode mismatches: `0` Missing `IONMODE`: `361` Total retained peaks: `17,098,616`
提供机构:
Zenodo
创建时间:
2026-05-05
二维码
社区交流群
二维码
科研交流群
商业服务