---
licence: mit
task_categories:
- graph-ml
---
# Dataset Card for alchemy
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [External Use](#external-use)
- [PyGeometric](#pygeometric)
- [Dataset Structure](#dataset-structure)
- [Data Properties](#data-properties)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **[Homepage](https://alchemy.tencent.com/)**
- **Paper:**: (see citation)
- **Leaderboard:**: [Leaderboard](https://alchemy.tencent.com/)
### Dataset Summary
The `alchemy` dataset is a molecular dataset, called Alchemy, which lists 12 quantum mechanical properties of 130,000+ organic molecules comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database.
### Supported Tasks and Leaderboards
`alchemy` should be used for organic quantum molecular property prediction, a regression task on 12 properties. The score used is MAE.
## External Use
### PyGeometric
To load in PyGeometric, do the following:
```python
from datasets import load_dataset
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
dataset_hf = load_dataset("graphs-datasets/<mydataset>")
# For the train set (replace by valid or test as needed)
dataset_pg_list = [Data(graph) for graph in dataset_hf["train"]]
dataset_pg = DataLoader(dataset_pg_list)
```
## Dataset Structure
### Data Properties
| property | value |
|---|---|
| scale | big |
| #graphs | 202578 |
| average #nodes | 10.101387606810183 |
| average #edges | 20.877326870011206 |
### Data Fields
Each row of a given file is a graph, with:
- `node_feat` (list: #nodes x #node-features): nodes
- `edge_index` (list: 2 x #edges): pairs of nodes constituting edges
- `edge_attr` (list: #edges x #edge-features): for the aforementioned edges, contains their features
- `y` (list: 1 x #labels): contains the number of labels available to predict (here 1, equal to zero or one)
- `num_nodes` (int): number of nodes of the graph
### Data Splits
This data is not split, and should be used with cross validation. It comes from the PyGeometric version of the dataset.
## Additional Information
### Licensing Information
The dataset has been released under license mit.
### Citation Information
```
@inproceedings{Morris+2020,
title={TUDataset: A collection of benchmark datasets for learning with graphs},
author={Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann},
booktitle={ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020)},
archivePrefix={arXiv},
eprint={2007.08663},
url={www.graphlearning.io},
year={2020}
}
```
```
@article{DBLP:journals/corr/abs-1906-09427,
author = {Guangyong Chen and
Pengfei Chen and
Chang{-}Yu Hsieh and
Chee{-}Kong Lee and
Benben Liao and
Renjie Liao and
Weiwen Liu and
Jiezhong Qiu and
Qiming Sun and
Jie Tang and
Richard S. Zemel and
Shengyu Zhang},
title = {Alchemy: {A} Quantum Chemistry Dataset for Benchmarking {AI} Models},
journal = {CoRR},
volume = {abs/1906.09427},
year = {2019},
url = {http://arxiv.org/abs/1906.09427},
eprinttype = {arXiv},
eprint = {1906.09427},
timestamp = {Mon, 11 Nov 2019 12:55:11 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-09427.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
---
许可证:MIT许可证
任务类别:
- 图机器学习(graph-ml)
---
# Alchemy数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与基准排行榜](#supported-tasks-and-leaderboards)
- [外部使用](#external-use)
- [PyGeometric](#pygeometric)
- [数据集结构](#dataset-structure)
- [数据属性](#data-properties)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [附加信息](#additional-information)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献声明](#contributions)
## 数据集描述
- **[主页](https://alchemy.tencent.com/)**
- **论文:**(详见引用信息)
- **基准排行榜:** [基准排行榜](https://alchemy.tencent.com/)
### 数据集摘要
`alchemy`数据集是一款名为Alchemy的分子数据集,收录了从GDBMedChem数据库中采样得到的13万余个有机分子的12种量子力学属性,这些分子最多包含12个重原子(碳C、氮N、氧O、硫S、氟F和氯Cl)。
### 支持任务与基准排行榜
该数据集可用于有机量子分子属性预测任务,即针对12种属性的回归任务,模型评估采用平均绝对误差(Mean Absolute Error,MAE)作为指标。
## 外部使用
### PyGeometric
若需通过PyGeometric加载该数据集,请执行以下代码:
python
from datasets import load_dataset
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader
dataset_hf = load_dataset('graphs-datasets/<mydataset>')
# For the train set (replace by valid or test as needed)
dataset_pg_list = [Data(graph) for graph in dataset_hf['train']]
dataset_pg = DataLoader(dataset_pg_list)
## 数据集结构
### 数据属性
| 属性 | 取值 |
|---|---|
| 规模 | 大 |
| 图总数 | 202578 |
| 平均节点数 | 10.101387606810183 |
| 平均边数 | 20.877326870011206 |
### 数据字段
每个文件的每一行对应一张图,包含以下字段:
- `node_feat`(列表:节点数 × 节点特征数):节点特征
- `edge_index`(列表:2 × 边数):构成边的节点对
- `edge_attr`(列表:边数 × 边特征数):上述边的特征信息
- `y`(列表:1 × 标签数):待预测的标签数量(此处为1,标签取值为0或1)
- `num_nodes`(整数):该图的节点总数
### 数据划分
该数据集未划分训练、验证与测试集,应结合交叉验证使用,其源自该数据集的PyGeometric版本。
## 附加信息
### 许可证信息
本数据集采用MIT许可证发布。
### 引用信息
@inproceedings{Morris+2020,
title={TUDataset: A collection of benchmark datasets for learning with graphs},
author={Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann},
booktitle={ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020)},
archivePrefix={arXiv},
eprint={2007.08663},
url={www.graphlearning.io},
year={2020}
}
@article{DBLP:journals/corr/abs-1906.09427,
author = {Guangyong Chen and
Pengfei Chen and
Chang{-}Yu Hsieh and
Chee{-}Kong Lee and
Benben Liao and
Renjie Liao and
Weiwen Liu and
Jiezhong Qiu and
Qiming Sun and
Jie Tang and
Richard S. Zemel and
Shengyu Zhang},
title = {Alchemy: {A} Quantum Chemistry Dataset for Benchmarking {AI} Models},
journal = {CoRR},
volume = {abs/1906.09427},
year = {2019},
url = {http://arxiv.org/abs/1906.09427},
eprinttype = {arXiv},
eprint = {1906.09427},
timestamp = {Mon, 11 Nov 2019 12:55:11 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1906.09427.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
### 贡献声明