SciMMIR

Name: SciMMIR
Creator: maas
Published: 2025-11-25 18:10:35
License: 暂无描述

魔搭社区2025-11-25 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/m-a-p/SciMMIR

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "SciMMIR_dataset" ## SciMMIR This is the repo for the paper [SciMMIR： Benchmarking Scientific Multi-modal Information Retrieval](https://arxiv.org/abs/2401.13478). ![main_result](./imgs/Framework.png) In this paper, we propose a novel SciMMIR benchmark and a corresponding dataset designed to address the gap in evaluating multi-modal information retrieval (MMIR) models in the scientific domain. It is worth mentioning that we define a data hierarchical architecture of "Two subsets, Five subcategories" and use human-created keywords to classify the data (as shown in the table below). ![main_result](./imgs/data_architecture.png) As shown in the table below, we conducted extensive baselines (both fine-tuning and zero-shot) within various subsets and subcategories. ![main_result](./imgs/main_result.png) For more detailed experimental results and analysis, please refer to our paper [SciMMIR](https://arxiv.org/abs/2401.13478). ## Dataset Our SciMMIR benchmark dataset used in this paper contains 537K scientific image-text pairs which are extracted from the latest 6 months' papers in Arxiv (2023.05 to 2023.10), and we will continue to expand this data by extracting data from more papers in Arxiv and provide larger versions of the dataset. The datasets can be obtained from huggingface Datasets [m-a-p/SciMMIR](https://huggingface.co/datasets/m-a-p/SciMMIR), and the following codes show how to use it: ```python import datasets ds_remote = datasets.load_dataset("m-a-p/SciMMIR") test_data = ds_remote['test'] caption = test_data[0]['text'] image_type = test_data[0]['class'] image = test_data[0]['image'] ``` ## Codes The codes of this paper can be found in our [Github](https://github.com/Wusiwei0410/SciMMIR) ## Potential TODOs before ACL **TODO**: case study table **TODO**: statistics of the paper fields (perhaps in appendix) **TODO**: See if it's possible to further divide the "Figure Results" subsets. ## Citation ``` @misc{wu2024scimmir, title={SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval}, author={Siwei Wu and Yizhi Li and Kang Zhu and Ge Zhang and Yiming Liang and Kaijing Ma and Chenghao Xiao and Haoran Zhang and Bohao Yang and Wenhu Chen and Wenhao Huang and Noura Al Moubayed and Jie Fu and Chenghua Lin}, year={2024}, eprint={2401.13478}, archivePrefix={arXiv}, primaryClass={cs.IR} } ``` [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

# SciMMIR数据集卡片 ## SciMMIR 本仓库对应论文《SciMMIR：Benchmarking Scientific Multi-modal Information Retrieval》（https://arxiv.org/abs/2401.13478）。 ![框架图](./imgs/Framework.png) 本研究提出了一款全新的SciMMIR基准测试平台与配套数据集，旨在填补科学领域多模态信息检索（Multi-modal Information Retrieval, MMIR）模型评估的空白。值得说明的是，我们定义了「双子集、五分类」的数据层级架构，并采用人工标注关键词完成数据分类（详见下表）。 ![数据架构图](./imgs/data_architecture.png) 如下表所示，我们在各类子集与分类下开展了大量基线模型实验，涵盖微调（fine-tuning）与零样本（zero-shot）两种范式。 ![主结果图](./imgs/main_result.png) 如需获取更详尽的实验结果与分析，请参阅我们的论文《SciMMIR》（https://arxiv.org/abs/2401.13478）。 ## 数据集本研究使用的SciMMIR基准数据集包含53.7万条科学图文对，数据源自arXiv平台2023年5月至2023年10月近六个月的最新论文。后续我们将通过提取arXiv更多论文数据持续扩充该数据集，并推出更大规模的版本。该数据集可从Hugging Face Datasets的[m-a-p/SciMMIR](https://huggingface.co/datasets/m-a-p/SciMMIR)获取，以下代码展示了其使用方法： python import datasets ds_remote = datasets.load_dataset("m-a-p/SciMMIR") test_data = ds_remote['test'] caption = test_data[0]['text'] image_type = test_data[0]['class'] image = test_data[0]['image'] ## 代码本研究的代码可在我们的GitHub仓库（https://github.com/Wusiwei0410/SciMMIR）获取。 ## ACL筹备待办事项 **待办事项**：案例研究表格 **待办事项**：论文领域统计（可置于附录） **待办事项**：探究是否可进一步细分「图表结果（Figure Results）」子集。 ## 引用 @misc{wu2024scimmir, title={SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval}, author={Siwei Wu and Yizhi Li and Kang Zhu and Ge Zhang and Yiming Liang and Kaijing Ma and Chenghao Xiao and Haoran Zhang and Bohao Yang and Wenhu Chen and Wenhao Huang and Noura Al Moubayed and Jie Fu and Chenghua Lin}, year={2024}, eprint={2401.13478}, archivePrefix={arXiv}, primaryClass={cs.IR} } [更多信息请参阅](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

maas

创建时间：

2024-04-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集