M3SciQA
收藏魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/M3SciQA
下载链接
链接失效反馈官方服务:
资源简介:
# 🧑🔬 M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark For Evaluating Foundatio Models
**EMNLP 2024 Findings**
🖥️ [Code](https://github.com/yale-nlp/M3SciQA)
## Introduction

In the realm of foundation models for scientific research, current benchmarks predominantly focus on single-document, text-only tasks and fail to adequately represent the complex workflow of such research. These benchmarks lack the $\textit{multi-modal}$, $\textit{multi-document}$ nature of scientific research, where comprehension also arises from interpreting non-textual data, such as figures and tables, and gathering information across multiple documents. To address this issue, we introduce M3SciQA, a Multi-Modal, Multi-document Scientific Question Answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing (NLP) papers clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 prominent foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the development of future foundation models.
## Main Results
### Locality-Specific Evaluation

### Detail-Specific Evaluation

## Cite
# 🧑🔬 M3SciQA:用于评估基础模型的多模态多文档科学问答基准
**EMNLP 2024 研究成果**
🖥️ [代码](https://github.com/yale-nlp/M3SciQA)
## 引言

在面向科学研究的基础模型领域中,现有基准测试大多仅聚焦于单文档、纯文本任务,无法充分体现科学研究的复杂工作流程。此类基准测试缺失了科学研究的**多模态(multi-modal)**与**多文档(multi-document)**属性——科学研究中的理解不仅源自对图表等非文本数据的解读,也需要从多篇文档中收集信息。为解决这一问题,我们提出M3SciQA:一款面向基础模型全面评估的多模态多文档科学问答基准测试集。M3SciQA包含1452条专家标注的问题,涵盖70个自然语言处理(NLP)论文集群,每个集群包含一篇核心论文及其所有引用文献;该设计通过要求模型处理多模态与多文档数据,还原了单篇科研论文的阅读理解工作流程。借助M3SciQA,我们对18款主流基础模型开展了全面评估。实验结果显示,在多模态信息检索与跨多篇科研文档的推理任务中,当前基础模型的性能仍显著落后于人类专家。此外,我们还探讨了本研究结果对未来基础模型研发的启示。
## 主要实验结果
### 局部信息特异性评估

### 细节信息特异性评估

## 引用
提供机构:
maas
创建时间:
2025-01-29



