SMolInstruct
收藏魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/osunlp/SMolInstruct
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align="center"> ⚛️ SMolInstruct </h1>
SMolInstruct is a **large-scale**, **comprehensive**, and **high-quality instruction tuning dataset** crafted for **chemistry**. It centers around small molecules, and contains 14 meticulously selected tasks and over 3M samples.
This dataset has both **SMILES** and **SELFIES** versions, and you could switch to SELFIES by using `use_selfies=True` when loading.
**Version History**
- v1.3.0 (2024.09.17): Added unique `sample_id` in each sample. Also added doc for `insert_core_tags` which you can control if core information is wraped with core tags (e.g., \<SMILES\> ... \</SMILES\>).
- v1.2.0 (2024.04.21): Added a small test subset with at most 200 samples for each task. You could use it by assigning `use_test_subset=True`. Also added `use_first` to load the first specific number of samples for each task. See below for details.
- v1.1.1 (2024.04.18): Fixed double tag problem (`<SMILES> <SMILES> ... </SMILES> </SMILES>`) for retrosynthesis. We recommend all to use this or newer version.
- v1.1.0 (2024.03.05): Deleted a small amount of samples with invalid molecules, and add SELFIES.
- v1.0.0 (2024.02.13): Uploaded the first version.
**Paper**: [LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset](https://arxiv.org/abs/2402.09391)
**Page**: [https://osu-nlp-group.github.io/LlaSMol](https://osu-nlp-group.github.io/LlaSMol)
**Code**: [https://github.com/OSU-NLP-Group/LlaSMol](https://github.com/OSU-NLP-Group/LlaSMol)
**Models**: [https://huggingface.co/osunlp/LlaSMol](https://huggingface.co/osunlp/LlaSMol)
## 🔭 Overview
The following figure illustrates the tasks and corresponding examples.

The following table shows the tasks and statistics over the SMolInstruct dataset, where “Qry.” and “Resp.” are average lengths of queries and responses, respectively.

An example is shown below:
```python
{
'sample_id': 'forward_synthesis.train.1'
'input': 'Based on the given reactants and reagents: <SMILES> CCCCCCCC/C=C\\CCCCCCCC(=O)OCCNCCOC(=O)CCCCCCC/C=C\\CCCCCCCC.CCN=C=NCCCN(C)C.CN(C)C1=CC=NC=C1.CN(C)CCSCC(=O)O.CO.Cl.ClCCl.O.O=C(O)C(F)(F)F.O=C([O-])[O-].[K+] </SMILES>, what product could potentially be produced?',
'output': 'The product can be <SMILES> CCCCCCCC/C=C\\CCCCCCCC(=O)OCCN(CCOC(=O)CCCCCCC/C=C\\CCCCCCCC)C(=O)CSCCN(C)C </SMILES> .',
'raw_input': 'CCCCCCCC/C=C\\CCCCCCCC(=O)OCCNCCOC(=O)CCCCCCC/C=C\\CCCCCCCC.CCN=C=NCCCN(C)C.CN(C)C1=CC=NC=C1.CN(C)CCSCC(=O)O.CO.Cl.ClCCl.O.O=C(O)C(F)(F)F.O=C([O-])[O-].[K+]',
'raw_output': 'CCCCCCCC/C=C\\CCCCCCCC(=O)OCCN(CCOC(=O)CCCCCCC/C=C\\CCCCCCCC)C(=O)CSCCN(C)C',
'split': 'train',
'task': 'forward_synthesis',
'input_core_tag_left': '<SMILES>',
'input_core_tag_right': '</SMILES>',
'output_core_tag_left': '<SMILES>',
'output_core_tag_right': '</SMILES>',
'target': None
}
```
## ⚔️ Usage
You can use the following lines to load the dataset:
```python
from datasets import load_dataset
dataset = load_dataset('osunlp/SMolInstruct')
train_set = dataset['train']
validation_set = dataset['validation']
test_set = dataset['test']
```
A SELFIES version could also be used, by simplying adding an argument:
```python
dataset = load_dataset('osunlp/SMolInstruct', use_selfies=True)
```
You can also specify what tasks to load:
```python
ALL_TASKS = (
'forward_synthesis',
'retrosynthesis',
'molecule_captioning',
'molecule_generation',
'name_conversion-i2f',
'name_conversion-i2s',
'name_conversion-s2f',
'name_conversion-s2i',
'property_prediction-esol',
'property_prediction-lipo',
'property_prediction-bbbp',
'property_prediction-clintox',
'property_prediction-hiv',
'property_prediction-sider',
)
train_set = load_dataset('osunlp/SMolInstruct', tasks=ALL_TASKS)
```
You could use `use_test_subset=True` to use a subset of the test set, to quickly evaluate your models. In this subset, each task has at most 200 samples.
```python
test_set = load_dataset('osunlp/SMolInstruct', split='test', use_test_subset=True)
```
You could also `use_first=INTEGER` to load only first at most `INTEGER` samples for each task.
```python
# load first 500 samples for each task
test_set = load_dataset('osunlp/SMolInstruct', split='test', use_first=500)
```
The argument `insert_core_tags` can control whether the core tags should be added. By default, it's `True`.
```python
test_set = load_dataset('osunlp/SMolInstruct', split='test', insert_core_tags=False)
```
## 🛠️ Evaluation
The evaluation code will be at [https://github.com/OSU-NLP-Group/LlaSMol](https://github.com/OSU-NLP-Group/LlaSMol).
## 🛠️ Data Construction
The construction of SMolInstruct goes through a four-step pipeline:
- **data collection**: Collect data from various sources and organize it for the tasks.
- **quality control**: Rigorous scrutiny is applied to remove samples with chemically invalid SMILES and wrong or inaccurate information, as well as duplicated samples.
- **data splitting**: Samples are carefully splitted into train/validation/test set to avoid data leakage across tasks. Also, the splitting is compatible with previous work to faciliate fair comparison.
- **instruction construction**: We create natural and diverse templates for creating instructions. Molecular SMILES representations are canonicalized to provide a standardized data format. In addition, we use special tags to encapsulate corresponding segments (e.g., <SMILES>...</SMILES>} for SMILES, etc.) to promote model learning during training and faciliate answer extraction during inference.
## 🚨 License
The **SMolInstruct** dataset is licensed under CC BY 4.0.
We emphatically urge all users to adhere to the highest ethical standards when using our dataset, including maintaining fairness, transparency, and responsibility in their research. Any usage of the dataset that may lead to harm or pose a detriment to society is strictly **forbidden**.
## 🔍 Citation
If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free to contact us with any inquiries.
```
@inproceedings{
yu2024llasmol,
title={Lla{SM}ol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
author={Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=lY6XTF9tPv}
}
```
Thank you for your interest in our work.
<h1 align="center"> ⚛️ SMolInstruct </h1>
SMolInstruct是一款专为化学领域打造的**大规模、全品类、高质量**的指令微调数据集,以小分子为核心研究对象,涵盖14项精心甄选的任务与超300万条样本。
该数据集同时提供**简化分子线性输入规范(SMILES)**与**SELFIES(SELFIES)**两种格式,加载时可通过设置`use_selfies=True`切换至SELFIES格式。
**版本沿革**
- v1.3.0 (2024.09.17): 为每条样本新增唯一标识`sample_id`,同时补充了`insert_core_tags`参数的使用文档——该参数可控制是否为核心信息添加核心标签(例如, `<SMILES> ... </SMILES>`)。
- v1.2.0 (2024.04.21): 新增各任务最多含200条样本的小型测试子集,可通过设置`use_test_subset=True`调用;同时新增`use_first`参数,用于加载各任务的前指定数量样本,详细说明见下文。
- v1.1.1 (2024.04.18): 修复了逆合成任务中的双标签问题(`<SMILES> <SMILES> ... </SMILES> </SMILES>`),建议所有用户使用该版本及更新版本。
- v1.1.0 (2024.03.05): 移除了少量含无效分子的样本,并新增了SELFIES格式支持。
- v1.0.0 (2024.02.13): 发布首个版本。
**论文**: [LlaSMol:基于大规模高质量全品类指令微调数据集推进化学领域大语言模型研究](https://arxiv.org/abs/2402.09391)
**项目主页**: [https://osu-nlp-group.github.io/LlaSMol](https://osu-nlp-group.github.io/LlaSMol)
**代码仓库**: [https://github.com/OSU-NLP-Group/LlaSMol](https://github.com/OSU-NLP-Group/LlaSMol)
**模型仓库**: [https://huggingface.co/osunlp/LlaSMol](https://huggingface.co/osunlp/LlaSMol)
## 🔭 数据集概览
下图展示了各任务及对应示例。

下表展示了SMolInstruct数据集的各任务及统计信息,其中“Qry.”与“Resp.”分别代表查询与响应的平均长度。

以下为一条样本示例:
python
{
'sample_id': 'forward_synthesis.train.1'
'input': 'Based on the given reactants and reagents: <SMILES> CCCCCCCC/C=C\CCCCCCCC(=O)OCCNCCOC(=O)CCCCCCC/C=C\CCCCCCCC.CCN=C=NCCCN(C)C.CN(C)C1=CC=NC=C1.CN(C)CCSCC(=O)O.CO.Cl.ClCCl.O.O=C(O)C(F)(F)F.O=C([O-])[O-].[K+] </SMILES>, what product could potentially be produced?',
'output': 'The product can be <SMILES> CCCCCCCC/C=C\CCCCCCCC(=O)OCCN(CCOC(=O)CCCCCCC/C=C\CCCCCCCC)C(=O)CSCCN(C)C </SMILES> .',
'raw_input': 'CCCCCCCC/C=C\CCCCCCCC(=O)OCCNCCOC(=O)CCCCCCC/C=C\CCCCCCCC.CCN=C=NCCCN(C)C.CN(C)C1=CC=NC=C1.CN(C)CCSCC(=O)O.CO.Cl.ClCCl.O.O=C(O)C(F)(F)F.O=C([O-])[O-].[K+]',
'raw_output': 'CCCCCCCC/C=C\CCCCCCCC(=O)OCCN(CCOC(=O)CCCCCCC/C=C\CCCCCCCC)C(=O)CSCCN(C)C',
'split': 'train',
'task': 'forward_synthesis',
'input_core_tag_left': '<SMILES>',
'input_core_tag_right': '</SMILES>',
'output_core_tag_left': '<SMILES>',
'output_core_tag_right': '</SMILES>',
'target': None
}
## ⚔️ 使用方法
可通过以下代码加载数据集:
python
from datasets import load_dataset
dataset = load_dataset('osunlp/SMolInstruct')
train_set = dataset['train']
validation_set = dataset['validation']
test_set = dataset['test']
若需使用SELFIES格式版本,仅需简单添加一个参数即可:
python
dataset = load_dataset('osunlp/SMolInstruct', use_selfies=True)
您也可以指定需加载的任务:
python
ALL_TASKS = (
'forward_synthesis',
'retrosynthesis',
'molecule_captioning',
'molecule_generation',
'name_conversion-i2f',
'name_conversion-i2s',
'name_conversion-s2f',
'name_conversion-s2i',
'property_prediction-esol',
'property_prediction-lipo',
'property_prediction-bbbp',
'property_prediction-clintox',
'property_prediction-hiv',
'property_prediction-sider',
)
train_set = load_dataset('osunlp/SMolInstruct', tasks=ALL_TASKS)
您可通过设置`use_test_subset=True`使用小型测试子集以快速评估模型,该子集中每个任务最多包含200条样本:
python
test_set = load_dataset('osunlp/SMolInstruct', split='test', use_test_subset=True)
您也可使用`use_first=INTEGER`参数加载各任务的前最多`INTEGER`条样本:
python
# 为每个任务加载前500条样本
test_set = load_dataset('osunlp/SMolInstruct', split='test', use_first=500)
`insert_core_tags`参数可控制是否添加核心标签,默认值为`True`。
python
test_set = load_dataset('osunlp/SMolInstruct', split='test', insert_core_tags=False)
## 🛠️ 评估
评估代码将发布于[https://github.com/OSU-NLP-Group/LlaSMol](https://github.com/OSU-NLP-Group/LlaSMol)。
## 🛠️ 数据集构建
SMolInstruct的构建遵循四步流程:
- **数据采集**:从多源渠道收集数据,并针对各任务进行结构化整理。
- **质量管控**:通过严格审查移除含化学无效SMILES、错误或不准确信息以及重复的样本。
- **数据划分**:将样本严格划分为训练集/验证集/测试集,避免跨任务的数据泄露;同时该划分方式与现有研究兼容,以保障公平对比。
- **指令构建**:设计自然且多样化的指令生成模板;将分子的SMILES表示规范化,以提供标准化的数据格式;此外,通过特殊标签封装对应片段(例如用于SMILES的`<SMILES>...</SMILES>`等),以提升模型训练阶段的学习效率,并简化推理阶段的答案提取流程。
## 🚨 许可协议
**SMolInstruct**数据集采用CC BY 4.0许可协议进行授权。
我们强烈敦促所有用户在使用本数据集时遵循最高伦理标准,包括在研究中保持公平、透明与责任意识。任何可能对社会造成伤害或损害的数据集使用行为均被严格禁止。
## 🔍 引用
若您的研究中用到了本论文或相关资源,请引用我们的工作。如有任何疑问,欢迎随时联系我们。
@inproceedings{
yu2024llasmol,
title={Lla{SM}ol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
author={Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun},
booktitle={First Conference on Language Modeling},
year={2024},
url={https://openreview.net/forum?id=lY6XTF9tPv}
}
感谢您对我们工作的关注。
提供机构:
maas
创建时间:
2025-07-04



