five

MolLangData: A Large-Scale Dataset of Paired Molecular Structures and Natural Language Descriptions

收藏
DataONE2026-05-04 更新2026-05-27 收录
下载链接:
https://search.dataone.org/view/sha256:cd0a6f16e5479c500158e32b77023112920627967a208ad059035f3c03d707cc
下载链接
链接失效反馈
官方服务:
资源简介:
MolLangData is a large-scale dataset containing 163,111 paired samples of molecular structures and natural language descriptions, generated via a rule-regularized method using large language models. The dataset comprises two subsets: generated_data (161,111 rows) containing AI-generated descriptions across easy, medium, and hard difficulty levels; and validated_data (2,000 rows) containing curated and human-validated examples with 98.6% overall precision. Each sample includes the compound's CID, SMILES notation, IUPAC name, and a natural language description (375–8,010 characters). The dataset is intended for training and evaluating molecular language models, supporting tasks such as molecular structure recognition, description generation, and molecular property reasoning.
创建时间:
2026-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作