MolLangData: A Large-Scale Dataset of Paired Molecular Structures and Natural Language Descriptions

DataONE2026-05-04 更新2026-05-27 收录

下载链接：

https://search.dataone.org/view/sha256:cd0a6f16e5479c500158e32b77023112920627967a208ad059035f3c03d707cc

下载链接

链接失效反馈

官方服务：

资源简介：

MolLangData is a large-scale dataset containing 163,111 paired samples of molecular structures and natural language descriptions, generated via a rule-regularized method using large language models. The dataset comprises two subsets: generated_data (161,111 rows) containing AI-generated descriptions across easy, medium, and hard difficulty levels; and validated_data (2,000 rows) containing curated and human-validated examples with 98.6% overall precision. Each sample includes the compound's CID, SMILES notation, IUPAC name, and a natural language description (375–8,010 characters). The dataset is intended for training and evaluating molecular language models, supporting tasks such as molecular structure recognition, description generation, and molecular property reasoning.

创建时间：

2026-05-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集