Leffingwell Odor Dataset
收藏Mendeley Data2024-03-27 更新2024-06-29 收录
下载链接:
https://zenodo.org/record/4085098
下载链接
链接失效反馈官方服务:
资源简介:
NOTE: It's easier to download this dataset from pyrfume. Here's how: # First install pyrfume in your Python environment. This can be done easily with pip.
# pip install pyrfume
import pyrfume
molecules = pyrfume.load_data('leffingwell/molecules.csv', remote=True)
behavior = pyrfume.load_data('leffingwell/behavior.csv', remote=True)
# e.g. to count the number of molecules with each descriptor
behavior.sum().sort_values(ascending=False).astype(int)
Predicting properties of molecules is an area of growing research in machine learning, particularly as models for learning from graph-valued inputs improve in sophistication and robustness. A molecular property prediction problem that has received comparatively little attention during this surge in research activity is building Structure-Odor Relationships (SOR) models (as opposed to Quantitative Structure-Activity Relationships, a term from medicinal chemistry). This is a 70+ year-old problem straddling chemistry, physics, neuroscience, and machine learning. To spur development on the SOR problem, we curated and cleaned a dataset of 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database. We provide featurizations of all molecules in the dataset using bit-based and count-based fingerprints, Mordred molecular descriptors, and the embeddings from our trained GNN model (Sanchez-Lengeling et al., 2019). This dataset is comprised of two files: leffingwell_data.csv: this contains molecular structures, and what they smell like, along with train, test, and cross-validation splits. More detail on the file structure is found in leffingwell_readme.pdf. leffingwell_embeddings.npz: this contains several featurizations of the molecules in the dataset. leffingwell_readme.pdf: a more detailed description of the data and its provenance, including expected performance metrics. LICENSE: a copy of the CC-BY-NC license language. The dataset, and all associated features, is freely available for research use under the CC-BY-NC license. If you use the data in a publication, please cite: @article{sanchez2019machine,
title={Machine learning for scent: Learning generalizable perceptual representations of small molecules},
author={Sanchez-Lengeling, Benjamin and Wei, Jennifer N and Lee, Brian K and Gerkin, Richard C and Aspuru-Guzik, Al{\'a}n and Wiltschko, Alexander B},
journal={arXiv preprint arXiv:1910.10685},
year={2019}
}
创建时间:
2023-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
Leffingwell Odor Dataset是一个用于机器学习研究的气味分子数据集,包含3523个分子及其专家标记的气味描述符,源自Leffingwell PMP 2001数据库。数据集提供了多种分子特征化方法,如指纹、Mordred描述符和GNN嵌入,旨在促进结构-气味关系(SOR)模型的开发。该数据集遵循CC-BY-NC许可,仅供非商业研究使用。
以上内容由遇见数据集搜集并总结生成



