IDEA-AI4S/MoleculeQA

Name: IDEA-AI4S/MoleculeQA
Creator: IDEA-AI4S
Published: 2024-11-26 15:40:29
License: 暂无描述

Hugging Face2024-11-26 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/IDEA-AI4S/MoleculeQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering language: - en tags: - chemistry - molecule --- # Dataset Card for MoleculeQA  ## Dataset Details ### Dataset Description  [MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension (EMNLP 2024)](https://aclanthology.org/2024.findings-emnlp.216) - **Curated by:** [IDEA-XL](https://github.com/IDEA-XL) - **Language(s) (NLP):** en - **License:** mit ### Dataset Sources  - **Repository:** https://github.com/IDEA-XL/MoleculeQA - **Paper [optional]:** https://arxiv.org/abs/2403.08192 ## Dataset Structure  ``` - JSON - All - train.json # 49,993 - valid.json # 5,795 - test.json # 5,786 - TXT - All - train.txt - valid.txt - test.txt - Property - Source - Structure - Usage ``` ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/gr1PDjhOXP-6c7Z8KaAMb.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/QELSG-259d4o1ByD-hi4H.png) ## Dataset Creation ### Curation Rationale  Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information. Traditional evaluations fail to assess a model’s factual correctness. To rectify this absence, we present MoleculeQA1, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative corpus. MoleculeQA is not only the first benchmark to evaluate molecular factual correctness but also the largest molecular QA dataset. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific aspects and pinpoints crucial factors for molecular modeling. Furthermore, we employ MoleculeQA in reinforcement learning to mitigate model hallucinations, thereby enhancing the factual correctness of generated information. ### Source Data  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/qbOw0mIWTztzZhbkWn0Tk.png) #### Data Collection and Processing  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/FqkfVhXeMJ6vaoY6Utqdp.png) ## Citation **BibTeX:** ``` @inproceedings{lu-etal-2024-moleculeqa, title = "{M}olecule{QA}: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension", author = "Lu, Xingyu and Cao, He and Liu, Zijing and Bai, Shengyuan and Chen, Leqing and Yao, Yuan and Zheng, Hai-Tao and Li, Yu", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.216", pages = "3769--3789", abstract = "Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information. Traditional evaluations fail to assess a model{'}s factual correctness. To rectify this absence, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative corpus. MoleculeQA is not only the first benchmark to evaluate molecular factual correctness but also the largest molecular QA dataset. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific aspects and pinpoints crucial factors for molecular modeling. Furthermore, we employ MoleculeQA in reinforcement learning to mitigate model hallucinations, thereby enhancing the factual correctness of generated information.", } ``` ## Dataset Card Authors [He CAO (CiaoHe)](https://github.com/CiaoHe)

--- 许可证：MIT协议任务类别： - 问答语言： - 英语标签： - 化学 - 分子 --- # MoleculeQA 数据集卡片  ## 数据集详情 ### 数据集描述  [《MoleculeQA：评估分子理解事实准确性的数据集（EMNLP 2024）》](https://aclanthology.org/2024.findings-emnlp.216) - **整理方：** [IDEA-XL](https://github.com/IDEA-XL) - **（自然语言处理所用）语言：** 英语 - **许可证：** MIT协议 ### 数据集来源  - **代码仓库：** https://github.com/IDEA-XL/MoleculeQA - **论文（可选）：** https://arxiv.org/abs/2403.08192 ## 数据集结构  - JSON格式 - 全量数据 - train.json # 49,993条 - valid.json # 5,795条 - test.json # 5,786条 - TXT格式 - 全量数据 - train.txt - valid.txt - test.txt - 属性（Property） - 来源（Source） - 结构（Structure） - 用途（Usage） ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/gr1PDjhOXP-6c7Z8KaAMb.png) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/QELSG-259d4o1ByD-hi4H.png) ## 数据集构建 ### 构建动机  大语言模型（Large Language Model, LLM）在分子研究中的作用日益凸显，但现有模型常生成错误信息。传统评估手段无法评估模型的事实正确性。为弥补这一空白，我们推出MoleculeQA——一款全新的问答（Question Answering, QA）数据集，包含超过23000个分子对应的62000组问答对。每组问答对均由人工编写的问题、1个正选项与3个负选项组成，其语义与权威语料中的分子描述保持一致。MoleculeQA不仅是首个评估分子事实正确性的基准数据集，也是目前规模最大的分子问答数据集。针对现有分子领域大语言模型在MoleculeQA上的全面评估揭示了其在特定方面的缺陷，并明确了分子建模的关键影响因素。此外，我们将MoleculeQA应用于强化学习以缓解模型幻觉问题，从而提升生成信息的事实正确性。 ### 源数据  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/qbOw0mIWTztzZhbkWn0Tk.png) #### 数据收集与处理  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63458f173cc8a5caf9b84e48/FqkfVhXeMJ6vaoY6Utqdp.png) ## 引用 **BibTeX：** @inproceedings{lu-etal-2024-moleculeqa, title = "{M}olecule{QA}: 评估分子理解事实准确性的数据集", author = "Lu, Xingyu and Cao, He and Liu, Zijing and Bai, Shengyuan and Chen, Leqing and Yao, Yuan and Zheng, Hai-Tao and Li, Yu", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-emnlp.216", pages = "3769--3789", abstract = "大语言模型在分子研究中的作用日益凸显，但现有模型常生成错误信息。传统评估手段无法评估模型的事实正确性。为弥补这一空白，我们推出MoleculeQA——一款全新的问答数据集，包含超过23000个分子对应的62000组问答对。每组问答对均由人工编写的问题、1个正选项与3个负选项组成，其语义与权威语料中的分子描述保持一致。MoleculeQA不仅是首个评估分子事实正确性的基准数据集，也是目前规模最大的分子问答数据集。针对现有分子领域大语言模型在MoleculeQA上的全面评估揭示了其在特定方面的缺陷，并明确了分子建模的关键影响因素。此外，我们将MoleculeQA应用于强化学习以缓解模型幻觉问题，从而提升生成信息的事实正确性。", } ## 数据集卡片作者 [曹贺（CiaoHe）](https://github.com/CiaoHe)

提供机构：

IDEA-AI4S

5,000+

优质数据集

54 个

任务类型

进入经典数据集