OpenMol/RCR_SP_70K_SMILES-MMChat
收藏Hugging Face2024-11-04 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/OpenMol/RCR_SP_70K_SMILES-MMChat
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: molecules
struct:
- name: selfies
sequence: string
- name: smiles
sequence: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: ground_truth
dtype: string
splits:
- name: train
num_bytes: 72083085
num_examples: 70988
- name: dev
num_bytes: 7822847
num_examples: 7694
- name: test
num_bytes: 7964807
num_examples: 7793
download_size: 13186858
dataset_size: 87870739
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: dev
path: data/dev-*
- split: test
path: data/test-*
---
Reaction Condition Prediction Dataset (Solvent Prediction)
- molecule representation format: 1D SMILES
- will further encode into 2D graph features
For Detail, refer to *PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes*: https://arxiv.org/pdf/2406.13193
数据集信息:
特征字段:
- 标识符(id):数据类型为64位整数(int64)
- 分子(molecules):结构体类型,包含两个子字段:
- SELFIES字符串序列(selfies):字符串序列
- SMILES(Simplified Molecular-Input Line-Entry System)字符串序列(smiles):字符串序列
- 对话消息(messages):列表类型,列表元素为结构体,包含:
- 内容(content):字符串类型
- 角色(role):字符串类型
- 真实标签(ground_truth):字符串类型
数据集划分:
- 训练集(train):字节数72083085,样本量70988
- 验证集(dev):字节数7822847,样本量7694
- 测试集(test):字节数7964807,样本量7793
下载大小:13186858字节
数据集总大小:87870739字节
配置项:
- 默认配置(default):数据文件映射如下:
- 训练集对应路径:data/train-*
- 验证集对应路径:data/dev-*
- 测试集对应路径:data/test-*
反应条件预测数据集(溶剂预测方向)
- 分子表示格式:一维SMILES(Simplified Molecular-Input Line-Entry System)
后续将进一步编码为二维图特征
详细信息请参阅论文《PRESTO:渐进式预训练提升合成化学任务效果》:https://arxiv.org/pdf/2406.13193
提供机构:
OpenMol



