five

M3SciQA

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/M3SciQA
下载链接
链接失效反馈
官方服务:
资源简介:
# 🧑‍🔬 M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark For Evaluating Foundatio Models **EMNLP 2024 Findings** 🖥️ [Code](https://github.com/yale-nlp/M3SciQA) ## Introduction ![image/png](./figures/overview.png) In the realm of foundation models for scientific research, current benchmarks predominantly focus on single-document, text-only tasks and fail to adequately represent the complex workflow of such research. These benchmarks lack the $\textit{multi-modal}$, $\textit{multi-document}$ nature of scientific research, where comprehension also arises from interpreting non-textual data, such as figures and tables, and gathering information across multiple documents. To address this issue, we introduce M3SciQA, a Multi-Modal, Multi-document Scientific Question Answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing (NLP) papers clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 prominent foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the development of future foundation models. ## Main Results ### Locality-Specific Evaluation ![image/png](./figures/MRR.png) ### Detail-Specific Evaluation ![image/png](./figures/detail.png) ## Cite

# 🧑‍🔬 M3SciQA:用于评估基础模型的多模态多文档科学问答基准 **EMNLP 2024 研究成果** 🖥️ [代码](https://github.com/yale-nlp/M3SciQA) ## 引言 ![image/png](./figures/overview.png) 在面向科学研究的基础模型领域中,现有基准测试大多仅聚焦于单文档、纯文本任务,无法充分体现科学研究的复杂工作流程。此类基准测试缺失了科学研究的**多模态(multi-modal)**与**多文档(multi-document)**属性——科学研究中的理解不仅源自对图表等非文本数据的解读,也需要从多篇文档中收集信息。为解决这一问题,我们提出M3SciQA:一款面向基础模型全面评估的多模态多文档科学问答基准测试集。M3SciQA包含1452条专家标注的问题,涵盖70个自然语言处理(NLP)论文集群,每个集群包含一篇核心论文及其所有引用文献;该设计通过要求模型处理多模态与多文档数据,还原了单篇科研论文的阅读理解工作流程。借助M3SciQA,我们对18款主流基础模型开展了全面评估。实验结果显示,在多模态信息检索与跨多篇科研文档的推理任务中,当前基础模型的性能仍显著落后于人类专家。此外,我们还探讨了本研究结果对未来基础模型研发的启示。 ## 主要实验结果 ### 局部信息特异性评估 ![image/png](./figures/MRR.png) ### 细节信息特异性评估 ![image/png](./figures/detail.png) ## 引用
提供机构:
maas
创建时间:
2025-01-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作