IDEA-AI4S/MoleculeQA
收藏Hugging Face2024-11-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/IDEA-AI4S/MoleculeQA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
language:
- en
tags:
- chemistry
- molecule
---
# Dataset Card for MoleculeQA
<!-- Provide a quick summary of the dataset. -->
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
[MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension (EMNLP 2024)](https://aclanthology.org/2024.findings-emnlp.216)
- **Curated by:** [IDEA-XL](https://github.com/IDEA-XL)
- **Language(s) (NLP):** en
- **License:** mit
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/IDEA-XL/MoleculeQA
- **Paper [optional]:** https://arxiv.org/abs/2403.08192
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
- JSON
- All
- train.json # 49,993
- valid.json # 5,795
- test.json # 5,786
- TXT
- All
- train.txt
- valid.txt
- test.txt
- Property
- Source
- Structure
- Usage
```


## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information. Traditional evaluations fail to assess a
model’s factual correctness. To rectify this absence, we present MoleculeQA1, a novel question answering (QA) dataset which possesses
62K QA pairs over 23K molecules. Each QA
pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description
from authoritative corpus. MoleculeQA is not
only the first benchmark to evaluate molecular
factual correctness but also the largest molecular QA dataset. A comprehensive evaluation on
MoleculeQA for existing molecular LLMs exposes their deficiencies in specific aspects and
pinpoints crucial factors for molecular modeling. Furthermore, we employ MoleculeQA
in reinforcement learning to mitigate model
hallucinations, thereby enhancing the factual
correctness of generated information.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->

#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->

## Citation
**BibTeX:**
```
@inproceedings{lu-etal-2024-moleculeqa,
title = "{M}olecule{QA}: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension",
author = "Lu, Xingyu and
Cao, He and
Liu, Zijing and
Bai, Shengyuan and
Chen, Leqing and
Yao, Yuan and
Zheng, Hai-Tao and
Li, Yu",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.216",
pages = "3769--3789",
abstract = "Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information. Traditional evaluations fail to assess a model{'}s factual correctness. To rectify this absence, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative corpus. MoleculeQA is not only the first benchmark to evaluate molecular factual correctness but also the largest molecular QA dataset. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific aspects and pinpoints crucial factors for molecular modeling. Furthermore, we employ MoleculeQA in reinforcement learning to mitigate model hallucinations, thereby enhancing the factual correctness of generated information.",
}
```
## Dataset Card Authors
[He CAO (CiaoHe)](https://github.com/CiaoHe)
---
许可证:MIT协议
任务类别:
- 问答
语言:
- 英语
标签:
- 化学
- 分子
---
# MoleculeQA 数据集卡片
<!-- 提供数据集的简要概述。 -->
## 数据集详情
### 数据集描述
<!-- 提供该数据集的详细概述。 -->
[《MoleculeQA:评估分子理解事实准确性的数据集(EMNLP 2024)》](https://aclanthology.org/2024.findings-emnlp.216)
- **整理方:** [IDEA-XL](https://github.com/IDEA-XL)
- **(自然语言处理所用)语言:** 英语
- **许可证:** MIT协议
### 数据集来源
<!-- 提供该数据集的基础链接。 -->
- **代码仓库:** https://github.com/IDEA-XL/MoleculeQA
- **论文(可选):** https://arxiv.org/abs/2403.08192
## 数据集结构
<!-- 本节介绍数据集字段,以及数据集结构的额外信息,例如划分数据集所用的标准、数据点之间的关系等。 -->
- JSON格式
- 全量数据
- train.json # 49,993条
- valid.json # 5,795条
- test.json # 5,786条
- TXT格式
- 全量数据
- train.txt
- valid.txt
- test.txt
- 属性(Property)
- 来源(Source)
- 结构(Structure)
- 用途(Usage)


## 数据集构建
### 构建动机
<!-- 构建该数据集的动机。 -->
大语言模型(Large Language Model, LLM)在分子研究中的作用日益凸显,但现有模型常生成错误信息。传统评估手段无法评估模型的事实正确性。为弥补这一空白,我们推出MoleculeQA——一款全新的问答(Question Answering, QA)数据集,包含超过23000个分子对应的62000组问答对。每组问答对均由人工编写的问题、1个正选项与3个负选项组成,其语义与权威语料中的分子描述保持一致。MoleculeQA不仅是首个评估分子事实正确性的基准数据集,也是目前规模最大的分子问答数据集。针对现有分子领域大语言模型在MoleculeQA上的全面评估揭示了其在特定方面的缺陷,并明确了分子建模的关键影响因素。此外,我们将MoleculeQA应用于强化学习以缓解模型幻觉问题,从而提升生成信息的事实正确性。
### 源数据
<!-- 本节介绍源数据(例如新闻文本与标题、社交媒体帖子、翻译语句等)。 -->

#### 数据收集与处理
<!-- 本节介绍数据收集与处理流程,例如数据选择标准、过滤与归一化方法、所用工具与库等。 -->

## 引用
**BibTeX:**
@inproceedings{lu-etal-2024-moleculeqa,
title = "{M}olecule{QA}: 评估分子理解事实准确性的数据集",
author = "Lu, Xingyu and
Cao, He and
Liu, Zijing and
Bai, Shengyuan and
Chen, Leqing and
Yao, Yuan and
Zheng, Hai-Tao and
Li, Yu",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.216",
pages = "3769--3789",
abstract = "大语言模型在分子研究中的作用日益凸显,但现有模型常生成错误信息。传统评估手段无法评估模型的事实正确性。为弥补这一空白,我们推出MoleculeQA——一款全新的问答数据集,包含超过23000个分子对应的62000组问答对。每组问答对均由人工编写的问题、1个正选项与3个负选项组成,其语义与权威语料中的分子描述保持一致。MoleculeQA不仅是首个评估分子事实正确性的基准数据集,也是目前规模最大的分子问答数据集。针对现有分子领域大语言模型在MoleculeQA上的全面评估揭示了其在特定方面的缺陷,并明确了分子建模的关键影响因素。此外,我们将MoleculeQA应用于强化学习以缓解模型幻觉问题,从而提升生成信息的事实正确性。",
}
## 数据集卡片作者
[曹贺(CiaoHe)](https://github.com/CiaoHe)
提供机构:
IDEA-AI4S



