M3SciQA

Name: M3SciQA
Creator: maas
Published: 2025-12-05 16:21:59
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/yale-nlp/M3SciQA

下载链接

链接失效反馈

官方服务：

资源简介：

# 🧑‍🔬 M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark For Evaluating Foundatio Models **EMNLP 2024 Findings** 🖥️ [Code](https://github.com/yale-nlp/M3SciQA) ## Introduction ![image/png](./figures/overview.png) In the realm of foundation models for scientific research, current benchmarks predominantly focus on single-document, text-only tasks and fail to adequately represent the complex workflow of such research. These benchmarks lack the $\textit{multi-modal}$, $\textit{multi-document}$ nature of scientific research, where comprehension also arises from interpreting non-textual data, such as figures and tables, and gathering information across multiple documents. To address this issue, we introduce M3SciQA, a Multi-Modal, Multi-document Scientific Question Answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing (NLP) papers clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 prominent foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the development of future foundation models. ## Main Results ### Locality-Specific Evaluation ![image/png](./figures/MRR.png) ### Detail-Specific Evaluation ![image/png](./figures/detail.png) ## Cite

# 🧑‍🔬 M3SciQA：用于评估基础模型的多模态多文档科学问答基准 **EMNLP 2024 研究成果** 🖥️ [代码](https://github.com/yale-nlp/M3SciQA) ## 引言 ![image/png](./figures/overview.png) 在面向科学研究的基础模型领域中，现有基准测试大多仅聚焦于单文档、纯文本任务，无法充分体现科学研究的复杂工作流程。此类基准测试缺失了科学研究的**多模态（multi-modal）**与**多文档（multi-document）**属性——科学研究中的理解不仅源自对图表等非文本数据的解读，也需要从多篇文档中收集信息。为解决这一问题，我们提出M3SciQA：一款面向基础模型全面评估的多模态多文档科学问答基准测试集。M3SciQA包含1452条专家标注的问题，涵盖70个自然语言处理（NLP）论文集群，每个集群包含一篇核心论文及其所有引用文献；该设计通过要求模型处理多模态与多文档数据，还原了单篇科研论文的阅读理解工作流程。借助M3SciQA，我们对18款主流基础模型开展了全面评估。实验结果显示，在多模态信息检索与跨多篇科研文档的推理任务中，当前基础模型的性能仍显著落后于人类专家。此外，我们还探讨了本研究结果对未来基础模型研发的启示。 ## 主要实验结果 ### 局部信息特异性评估 ![image/png](./figures/MRR.png) ### 细节信息特异性评估 ![image/png](./figures/detail.png) ## 引用

提供机构：

maas

创建时间：

2025-01-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集