REAL-MM-RAG_FinReport_BEIR
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/REAL-MM-RAG_FinReport_BEIR
下载链接
链接失效反馈官方服务:
资源简介:
<style>
/* H1{color:Blue !important;} */
/* H1{color:DarkOrange !important;}
H2{color:DarkOrange !important;}
H3{color:DarkOrange !important;} */
/* p{color:Black !important;} */
</style>
# BEIR Version of REAL-MM-RAG_FinReport
## Summary
This dataset is the **BEIR-compatible version** of the following Hugging Face dataset:
- [`ibm-research/REAL-MM-RAG_FinReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport)
It has been reformatted into the **BEIR structure** for evaluation in retrieval settings.
The original dataset is QA-style (each row is a query tied to a document image).
Here, queries, qrels, docs, and corpus are separated into BEIR-standard splits.
## **REAL-MM-RAG_FinReport**
- **Content**: 19 financial reports from 2005–2023.
- **Size**: 2,687 pages.
- **Composition**: Includes both textual data and structured tables.
- **Purpose**: Designed to test model performance on table-heavy financial data retrieval.
## Format
The dataset is provided under the `"test"` split and contains the following subsets:
- **queries**:
- `query-id` (string)
- `query` (string)
- `rephrase_level_1/2/3` (string)
- `language` (string)
- **qrels**:
- `query-id` (string)
- `corpus-id` (string)
- `answer` (string)
- `score` (int, relevance = 1)
- **docs**:
- `doc-id` (string)
- **corpus**:
- `corpus-id` (string, unique per image)
- `image` (stored as PIL.Image)
- `image_filename` (string, filename without extension)
- `doc-id` (string, extracted from filename)
- **default**: alias of `queries` (for convenience).
## Source Paper
```bibtex
@misc{wasserman2025realmmragrealworldmultimodalretrieval,
title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark},
author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky},
year={2025},
eprint={2502.12342},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2502.12342},
}
```
# 适配BEIR格式的REAL-MM-RAG_FinReport数据集
## 数据集摘要
本数据集为下述Hugging Face数据集的**BEIR兼容版本**:
- [`ibm-research/REAL-MM-RAG_FinReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport)
本数据集已按照BEIR标准格式重构,用于检索场景下的模型评估。原始数据集采用问答(QA)格式,每一行均为一个与文档图像绑定的查询。本版本中,查询集(queries)、相关度标注集(qrels)、文档子集(docs)与语料库(corpus)已按照BEIR标准划分为独立子集。
## REAL-MM-RAG_FinReport数据集概况
- **内容**:包含2005年至2023年的19份财务报告
- **规模**:共计2687页
- **构成**:同时包含文本数据与结构化表格
- **用途**:专为测试模型在表格密集型财务数据检索任务中的性能表现而设计
## 数据集格式
本数据集仅包含`"test"`划分,并包含以下子集:
- **查询集(queries)**:
- `query-id`(字符串类型,查询唯一标识符)
- `query`(字符串类型,原始查询文本)
- `rephrase_level_1/2/3`(字符串类型,1/2/3级重写查询文本)
- `language`(字符串类型,查询语言)
- **相关度标注集(qrels)**:
- `query-id`(字符串类型,关联查询的唯一标识符)
- `corpus-id`(字符串类型,关联语料的唯一标识符)
- `answer`(字符串类型,对应标准答案)
- `score`(整数类型,相关度分值,1代表相关)
- **文档子集(docs)**:
- `doc-id`(字符串类型,文档唯一标识符)
- **语料库(corpus)**:
- `corpus-id`(字符串类型,单图像唯一标识符)
- `image`(以PIL.Image格式存储的图像数据)
- `image_filename`(字符串类型,不含扩展名的图像文件名)
- `doc-id`(字符串类型,从文件名提取的文档标识符)
- **default**:为便于使用,为`queries`的别名。
## 源论文
bibtex
@misc{wasserman2025realmmragrealworldmultimodalretrieval,
title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark},
author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky},
year={2025},
eprint={2502.12342},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2502.12342},
}
提供机构:
maas
创建时间:
2025-10-03



