MolLangData: A Large-Scale Dataset of Paired Molecular Structures and Natural Language Descriptions
收藏DataONE2026-05-04 更新2026-05-27 收录
下载链接:
https://search.dataone.org/view/sha256:cd0a6f16e5479c500158e32b77023112920627967a208ad059035f3c03d707cc
下载链接
链接失效反馈官方服务:
资源简介:
MolLangData is a large-scale dataset containing 163,111 paired samples of molecular structures and natural language descriptions, generated via a rule-regularized method using large language models. The dataset comprises two subsets: generated_data (161,111 rows) containing AI-generated descriptions across easy, medium, and hard difficulty levels; and validated_data (2,000 rows) containing curated and human-validated examples with 98.6% overall precision. Each sample includes the compound's CID, SMILES notation, IUPAC name, and a natural language description (375–8,010 characters). The dataset is intended for training and evaluating molecular language models, supporting tasks such as molecular structure recognition, description generation, and molecular property reasoning.
创建时间:
2026-05-07



