REAL-MM-RAG_TechReport_BEIR
收藏魔搭社区2025-12-05 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/REAL-MM-RAG_TechReport_BEIR
下载链接
链接失效反馈官方服务:
资源简介:
<style>
/* H1{color:Blue !important;} */
/* H1{color:DarkOrange !important;}
H2{color:DarkOrange !important;}
H3{color:DarkOrange !important;} */
/* p{color:Black !important;} */
</style>
# BEIR Version of REAL-MM-RAG_TechReport
## Summary
This dataset is the **BEIR-compatible version** of the following Hugging Face dataset:
- [`ibm-research/REAL-MM-RAG_TechReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechReport)
It has been reformatted into the **BEIR structure** for evaluation in retrieval settings.
The original dataset is QA-style (each row is a query tied to a document image).
Here, queries, qrels, docs, and corpus are separated into BEIR-standard splits.
## **REAL-MM-RAG_TechReport**
- **Content**: 17 technical documents on IBM FlashSystem.
- **Size**: 1,674 pages.
- **Composition**: Text-heavy with visual elements and structured tables.
- **Purpose**: Assesses model performance in retrieving structured technical content.
## Format
The dataset is provided under the `"test"` split and contains the following subsets:
- **queries**:
- `query-id` (string)
- `query` (string)
- `rephrase_level_1/2/3` (string)
- `language` (string)
- **qrels**:
- `query-id` (string)
- `corpus-id` (string)
- `answer` (string)
- `score` (int, relevance = 1)
- **docs**:
- `doc-id` (string)
- **corpus**:
- `corpus-id` (string, unique per image)
- `image` (stored as PIL.Image)
- `image_filename` (string, filename without extension)
- `doc-id` (string, extracted from filename)
- **default**: alias of `queries` (for convenience).
## Source Paper
```bibtex
@misc{wasserman2025realmmragrealworldmultimodalretrieval,
title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark},
author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky},
year={2025},
eprint={2502.12342},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2502.12342},
}
```
<style>
/* H1{color:Blue !important;} */
/* H1{color:DarkOrange !important;}
H2{color:DarkOrange !important;}
H3{color:DarkOrange !important;} */
/* p{color:Black !important;} */
</style>
# 适配BEIR格式的REAL-MM-RAG_TechReport数据集版本
## 数据集概述
本数据集为下述Hugging Face数据集的**BEIR兼容版本(BEIR)**:
- [`ibm-research/REAL-MM-RAG_TechReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_TechReport)
已被重构为**BEIR标准结构(BEIR)**,以用于检索任务的模型评估。
原始数据集采用问答(QA)范式,每条数据对应一个绑定文档图像的查询。
本版本中,查询(queries)、相关性标注集(qrels)、文档(docs)与语料库(corpus)均按照BEIR标准划分拆分。
## REAL-MM-RAG_TechReport 数据集详情
- **内容**:17份针对IBM FlashSystem的技术文档。
- **规模**:共计1674页。
- **组成**:以文本为主体,兼具视觉元素与结构化表格。
- **用途**:用于评估模型检索结构化技术内容的性能。
## 数据格式
本数据集仅包含`"test"`拆分,并包含以下子集:
- **查询集(queries)**:
- `query-id`(字符串类型):查询唯一标识符
- `query`(字符串类型):查询文本
- `rephrase_level_1/2/3`(字符串类型):1/2/3级重写查询文本
- `language`(字符串类型):查询语言类型
- **相关性标注集(qrels)**:
- `query-id`(字符串类型):查询唯一标识符
- `corpus-id`(字符串类型):语料库唯一标识符
- `answer`(字符串类型):对应答案文本
- `score`(整数类型,相关性分值为1)
- **文档集(docs)**:
- `doc-id`(字符串类型):文档唯一标识符
- **语料库(corpus)**:
- `corpus-id`(字符串类型,每张图像唯一):语料库唯一标识符
- `image`(以PIL.Image格式存储):图像内容
- `image_filename`(字符串类型,不含扩展名的文件名):图像文件名
- `doc-id`(字符串类型,从文件名提取):文档唯一标识符
- **default**:作为查询集(queries)的别名,便于快速调用。
## 来源论文
bibtex
@misc{wasserman2025realmmragrealworldmultimodalretrieval,
title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark},
author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky},
year={2025},
eprint={2502.12342},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2502.12342},
}
提供机构:
maas
创建时间:
2025-10-03



