OpenMol/PubChemSFT
收藏数据集概述
数据集文件
- 文件名:
all_clean.json-
移除了与ChEBI-20测试集重叠的部分。
-
移除了无描述的SMILES。
-
数据格式: python { SMILES <str>: [ ["Please describe the molecule", DESCRIPTION], ..., ] }
-
统计信息:
- 最大令牌长度: 6113
- 最小令牌长度: 20
- 平均令牌长度: 191
- 中位数令牌长度: 149
- 总单轮对话数: 326,689
- 总SMILES示例数: 293,302
-
数据集大小
- 训练集: 264,391
- 验证集: 33,072
- 测试集: 32,987
对话模板
python conversation:{ [ "from": "human", "value": <QUERY>, # 随机从查询模板中抽样 ], [ "from": "gpt", "value": <TEXT>, # 关于给定分子的描述 ], ]
数据集条目内容
"graph": [ "edge_index":, # 数组形式 (int64) "edge_feat":, # 数组形式 (int64) "node_feat":, # 数组形式 (int64) "num_nodes":, # 整数 ], "conversation": # 如上所述
查询模板
python { <image> Could you give me a brief overview of this molecule?, <image> Could you provide a description of this molecule?, <image> Describe this molecule., <image> Please give me some details about this molecule., <image> Provide a brief overview of this molecule., <image> Provide a description of this molecule., <image> What can you tell me about this molecule?, Could you give me a brief overview of this molecule? <image>, Could you provide a description of this molecule? <image>, Describe this molecule. <image>, Please give me some details about this molecule. <image>, Provide a brief overview of this molecule. <image>, Provide a description of this molecule. <image>, What can you tell me about this molecule? <image> }




