L+M-24
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/language-plus-molecules/lpm-24-dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集(L+M-24)旨在强调自然语言在分子设计中的三大关键优势:组合性、功能性和抽象性。它由来自不同来源的分子描述对组成,其中包括从PubChem和专利文献中提取的属性。评估集被分为两个任务,同时发布了使用训练数据的特殊验证分割。数据规模方面,训练集包含160,492对数据,评估集包含21,839对数据。任务类型包括分子配字和分子生成。
This dataset (L+M-24) is designed to highlight three critical advantages of natural language in molecular design: compositionality, functionality, and abstractness. It comprises molecular description pairs from diverse sources, including properties extracted from PubChem and patent literature. The evaluation set is divided into two tasks, and a specialized validation split using the training data was also released concurrently. In terms of data scale, the training set contains 160,492 pairs of data, while the evaluation set holds 21,839 pairs. The task categories include molecular captioning and molecular generation.
提供机构:
Authors of the paper



