phenixace/OpenMolIns-xlarge
收藏Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/phenixace/OpenMolIns-xlarge
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
dataset_info:
config_name: OpenMolIns-xlarge
size: 1200000
---
# OpenMolIns Instruction Tuning Dataset (XLarge)
Instruction tuning dataset for **Open-domain Natural Language-Driven Molecule Generation**, aligned with [S²-Bench (TOMG)](https://phenixace.github.io/tomgbench/).
This is the **xlarge** variant with **1,200,000** instruction–molecule pairs.
## Task Types
The dataset covers 9 molecular generation and optimization subtasks (aligned with S²-Bench configurations):
- **MolCustom_AtomNum**: Molecular customized generation by atom number
- **MolCustom_BondNum**: Molecular customized generation by bond number
- **MolCustom_FunctionalGroup**: Molecular customized generation by functional group
- **MolEdit_AddComponent**: Molecular editing – adding components
- **MolEdit_SubComponent**: Molecular editing – substituting components
- **MolEdit_DelComponent**: Molecular editing – deleting components
- **MolOpt_LogP**: Molecular optimization for LogP
- **MolOpt_MR**: Molecular optimization for MR
- **MolOpt_QED**: Molecular optimization for QED
## Dataset Structure
| Column | Description |
|-----------|--------------------------------------------|
| SubTask | One of: AtomNum, BondNum, FunctionalGroup, AddComponent, SubComponent, DelComponent, LogP, MR, QED |
| Instruction | Natural language instruction |
| molecule | Target molecule (SMILES) |
## Usage
```python
from datasets import load_dataset
# Load the xlarge training set
dataset = load_dataset("phenixace/OpenMolIns-xlarge")
# dataset["train"]: SubTask, Instruction, molecule
print(dataset["train"].num_rows) # 1200000
```
## OpenMolIns Variants
| Variant | # Instructions |
|---------|----------------|
| light | 4,500 |
| small | 18,000 |
| medium | 45,000 |
| large | 90,000 |
| xlarge | 1,200,000 |
## Evaluation
Models trained on OpenMolIns can be evaluated on [S²-Bench (TOMG)](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench). See the [benchmark leaderboard](https://phenixace.github.io/tomgbench/) for results. The **OpenMolIns-xlarge** variant is used to train the top-performing model (Llama3.1-8B with OpenMolIns-xlarge) on the S²-Bench leaderboard.
## Citation
If you use this dataset, please cite:
```bibtex
@article{li2024speak,
title={Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation},
author={Li, Jiatong and Li, Junxian and Liu, Yunqing and Zheng, Changmeng and Wei, Xiaoyong and Zhou, Dongzhan and Li, Qing},
journal={arXiv preprint arXiv:2412.14642v3},
year={2024}
}
```
## Links
- [S²-Bench / TOMG Benchmark](https://phenixace.github.io/tomgbench/)
- [S2-TOMG-Bench GitHub](https://github.com/phenixace/S2-TOMG-Bench)
- [S²-Bench Dataset on Hugging Face](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)
---
许可证: Apache-2.0
语言:
- 英语
数据集信息:
配置名称: OpenMolIns-xlarge
数据规模: 1200000
---
# OpenMolIns 指令微调数据集(超大版)
本数据集为**开放域自然语言驱动的分子生成**专用指令微调数据集,与[S²-Bench (TOMG)](https://phenixace.github.io/tomgbench/)对齐。本数据集为**xlarge**变体,包含**1,200,000**条指令-分子配对数据。
## 任务类型
本数据集涵盖9类与S²-Bench配置对齐的分子生成与优化子任务:
- **MolCustom_AtomNum**: 基于原子数量的分子定制生成
- **MolCustom_BondNum**: 基于化学键数量的分子定制生成
- **MolCustom_FunctionalGroup**: 基于官能团的分子定制生成
- **MolEdit_AddComponent**: 分子编辑——添加组分
- **MolEdit_SubComponent**: 分子编辑——替换组分
- **MolEdit_DelComponent**: 分子编辑——删除组分
- **MolOpt_LogP**: 面向LogP的分子优化
- **MolOpt_MR**: 面向MR(摩尔折射率,Molar Refractivity)的分子优化
- **MolOpt_QED**: 面向QED(药物相似性定量估计,Quantitative Estimation of Drug-likeness)的分子优化
## 数据集结构
| 列名 | 描述 |
|-----------|--------------------------------------------|
| SubTask | 可选值包括:AtomNum、BondNum、FunctionalGroup、AddComponent、SubComponent、DelComponent、LogP、MR、QED |
| Instruction | 自然语言指令 |
| molecule | 目标分子(采用SMILES,即简化分子线性输入规范,"Simplified Molecular Input Line Entry System"格式) |
## 使用方法
python
from datasets import load_dataset
# 加载超大版训练集
dataset = load_dataset("phenixace/OpenMolIns-xlarge")
# dataset["train"] 包含 SubTask、Instruction、molecule 三列
print(dataset["train"].num_rows) # 1200000
## OpenMolIns 变体版本
| 变体版本 | 指令条数 |
|---------|----------------|
| light | 4,500 |
| small | 18,000 |
| medium | 45,000 |
| large | 90,000 |
| xlarge | 1,200,000 |
## 模型评估
基于OpenMolIns训练的模型可在[S²-Bench (TOMG)](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)上进行评估。可参考[基准测试排行榜](https://phenixace.github.io/tomgbench/)查看实验结果。S²-Bench排行榜上性能最优的模型(基于OpenMolIns超大版训练的Llama3.1-8B)即采用OpenMolIns-xlarge变体进行训练。
## 引用信息
若使用本数据集,请引用以下文献:
bibtex
@article{li2024speak,
title={Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation},
author={Li, Jiatong and Li, Junxian and Liu, Yunqing and Zheng, Changmeng and Wei, Xiaoyong and Zhou, Dongzhan and Li, Qing},
journal={arXiv preprint arXiv:2412.14642v3},
year={2024}
}
## 相关链接
- [S²-Bench / TOMG 基准测试](https://phenixace.github.io/tomgbench/)
- [S2-TOMG-Bench 开源代码库](https://github.com/phenixace/S2-TOMG-Bench)
- [Hugging Face 平台上的S²-Bench数据集](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)
提供机构:
phenixace



