oxai4science/sagan-mc

Name: oxai4science/sagan-mc
Creator: oxai4science
Published: 2025-05-27 20:10:58
License: 暂无描述

Hugging Face2025-05-27 更新2025-11-01 收录

下载链接：

https://hf-mirror.com/datasets/oxai4science/sagan-mc

下载链接

链接失效反馈

官方服务：

资源简介：

SaganMC是一个机器学习就绪的数据集，旨在进行分子复杂度预测、光谱分析和化学发现。分子复杂度指标量化了分子的结构复杂性，反映了构建或合成的难度。该数据集包含406,446个分子，其中16,653个分子包括实验质谱数据。我们提供了标准表示（SMILES、InChI、SELFIES）、RDKit派生的分子描述符、Morgan指纹以及三种互补的复杂度评分：Bertz、Böttcher和分子组装指数（MA）。MA评分，使用Cronin Group的代码计算，特别与天体生物学研究相关，作为潜在的不可知生命标志。将MA指标分配给分子需要大量的计算，生成这个数据集在Google Cloud上需要超过100,000 CPU小时。SaganMC以天文学家和科学传播者卡尔·萨根的名字命名，他的工作激励了几代人去探索地球以外的生命。该数据集的初始版本是在NASA Frontier Development Lab (FDL) 天体生物学冲刺期间产生的。

SaganMC is a machine learning-ready dataset designed for molecular complexity prediction, spectral analysis, and chemical discovery. Molecular complexity metrics quantify how structurally intricate a molecule is, reflecting how difficult it is to construct or synthesize. The dataset includes 406,446 molecules. A subset of 16,653 molecules includes experimental mass spectra. We provide standard representations (SMILES, InChI, SELFIES), RDKit-derived molecular descriptors, Morgan fingerprints, and three complementary complexity scores: Bertz, Böttcher, and the Molecular Assembly Index (MA). MA scores, computed using code from the Cronin Group, are especially relevant to astrobiology research as potential agnostic biosignatures. Assigning MA indices to molecules is compute intensive, and generating this dataset required over 100,000 CPU hours on Google Cloud. SaganMC is named in honor of Carl Sagan, the astronomer and science communicator whose work inspired generations to explore life beyond Earth. The initial version of this dataset was produced during a NASA Frontier Development Lab (FDL) astrobiology sprint.

提供机构：

oxai4science

5,000+

优质数据集

54 个

任务类型

进入经典数据集