five

phenixace/OpenMolIns-xlarge

收藏
Hugging Face2026-02-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/phenixace/OpenMolIns-xlarge
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en dataset_info: config_name: OpenMolIns-xlarge size: 1200000 --- # OpenMolIns Instruction Tuning Dataset (XLarge) Instruction tuning dataset for **Open-domain Natural Language-Driven Molecule Generation**, aligned with [S²-Bench (TOMG)](https://phenixace.github.io/tomgbench/). This is the **xlarge** variant with **1,200,000** instruction–molecule pairs. ## Task Types The dataset covers 9 molecular generation and optimization subtasks (aligned with S²-Bench configurations): - **MolCustom_AtomNum**: Molecular customized generation by atom number - **MolCustom_BondNum**: Molecular customized generation by bond number - **MolCustom_FunctionalGroup**: Molecular customized generation by functional group - **MolEdit_AddComponent**: Molecular editing – adding components - **MolEdit_SubComponent**: Molecular editing – substituting components - **MolEdit_DelComponent**: Molecular editing – deleting components - **MolOpt_LogP**: Molecular optimization for LogP - **MolOpt_MR**: Molecular optimization for MR - **MolOpt_QED**: Molecular optimization for QED ## Dataset Structure | Column | Description | |-----------|--------------------------------------------| | SubTask | One of: AtomNum, BondNum, FunctionalGroup, AddComponent, SubComponent, DelComponent, LogP, MR, QED | | Instruction | Natural language instruction | | molecule | Target molecule (SMILES) | ## Usage ```python from datasets import load_dataset # Load the xlarge training set dataset = load_dataset("phenixace/OpenMolIns-xlarge") # dataset["train"]: SubTask, Instruction, molecule print(dataset["train"].num_rows) # 1200000 ``` ## OpenMolIns Variants | Variant | # Instructions | |---------|----------------| | light | 4,500 | | small | 18,000 | | medium | 45,000 | | large | 90,000 | | xlarge | 1,200,000 | ## Evaluation Models trained on OpenMolIns can be evaluated on [S²-Bench (TOMG)](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench). See the [benchmark leaderboard](https://phenixace.github.io/tomgbench/) for results. The **OpenMolIns-xlarge** variant is used to train the top-performing model (Llama3.1-8B with OpenMolIns-xlarge) on the S²-Bench leaderboard. ## Citation If you use this dataset, please cite: ```bibtex @article{li2024speak, title={Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation}, author={Li, Jiatong and Li, Junxian and Liu, Yunqing and Zheng, Changmeng and Wei, Xiaoyong and Zhou, Dongzhan and Li, Qing}, journal={arXiv preprint arXiv:2412.14642v3}, year={2024} } ``` ## Links - [S²-Bench / TOMG Benchmark](https://phenixace.github.io/tomgbench/) - [S2-TOMG-Bench GitHub](https://github.com/phenixace/S2-TOMG-Bench) - [S²-Bench Dataset on Hugging Face](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)

--- 许可证: Apache-2.0 语言: - 英语 数据集信息: 配置名称: OpenMolIns-xlarge 数据规模: 1200000 --- # OpenMolIns 指令微调数据集(超大版) 本数据集为**开放域自然语言驱动的分子生成**专用指令微调数据集,与[S²-Bench (TOMG)](https://phenixace.github.io/tomgbench/)对齐。本数据集为**xlarge**变体,包含**1,200,000**条指令-分子配对数据。 ## 任务类型 本数据集涵盖9类与S²-Bench配置对齐的分子生成与优化子任务: - **MolCustom_AtomNum**: 基于原子数量的分子定制生成 - **MolCustom_BondNum**: 基于化学键数量的分子定制生成 - **MolCustom_FunctionalGroup**: 基于官能团的分子定制生成 - **MolEdit_AddComponent**: 分子编辑——添加组分 - **MolEdit_SubComponent**: 分子编辑——替换组分 - **MolEdit_DelComponent**: 分子编辑——删除组分 - **MolOpt_LogP**: 面向LogP的分子优化 - **MolOpt_MR**: 面向MR(摩尔折射率,Molar Refractivity)的分子优化 - **MolOpt_QED**: 面向QED(药物相似性定量估计,Quantitative Estimation of Drug-likeness)的分子优化 ## 数据集结构 | 列名 | 描述 | |-----------|--------------------------------------------| | SubTask | 可选值包括:AtomNum、BondNum、FunctionalGroup、AddComponent、SubComponent、DelComponent、LogP、MR、QED | | Instruction | 自然语言指令 | | molecule | 目标分子(采用SMILES,即简化分子线性输入规范,"Simplified Molecular Input Line Entry System"格式) | ## 使用方法 python from datasets import load_dataset # 加载超大版训练集 dataset = load_dataset("phenixace/OpenMolIns-xlarge") # dataset["train"] 包含 SubTask、Instruction、molecule 三列 print(dataset["train"].num_rows) # 1200000 ## OpenMolIns 变体版本 | 变体版本 | 指令条数 | |---------|----------------| | light | 4,500 | | small | 18,000 | | medium | 45,000 | | large | 90,000 | | xlarge | 1,200,000 | ## 模型评估 基于OpenMolIns训练的模型可在[S²-Bench (TOMG)](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)上进行评估。可参考[基准测试排行榜](https://phenixace.github.io/tomgbench/)查看实验结果。S²-Bench排行榜上性能最优的模型(基于OpenMolIns超大版训练的Llama3.1-8B)即采用OpenMolIns-xlarge变体进行训练。 ## 引用信息 若使用本数据集,请引用以下文献: bibtex @article{li2024speak, title={Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation}, author={Li, Jiatong and Li, Junxian and Liu, Yunqing and Zheng, Changmeng and Wei, Xiaoyong and Zhou, Dongzhan and Li, Qing}, journal={arXiv preprint arXiv:2412.14642v3}, year={2024} } ## 相关链接 - [S²-Bench / TOMG 基准测试](https://phenixace.github.io/tomgbench/) - [S2-TOMG-Bench 开源代码库](https://github.com/phenixace/S2-TOMG-Bench) - [Hugging Face 平台上的S²-Bench数据集](https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)
提供机构:
phenixace
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作