five

REAL-MM-RAG_FinReport_BEIR

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/REAL-MM-RAG_FinReport_BEIR
下载链接
链接失效反馈
官方服务:
资源简介:
<style> /* H1{color:Blue !important;} */ /* H1{color:DarkOrange !important;} H2{color:DarkOrange !important;} H3{color:DarkOrange !important;} */ /* p{color:Black !important;} */ </style> # BEIR Version of REAL-MM-RAG_FinReport ## Summary This dataset is the **BEIR-compatible version** of the following Hugging Face dataset: - [`ibm-research/REAL-MM-RAG_FinReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport) It has been reformatted into the **BEIR structure** for evaluation in retrieval settings. The original dataset is QA-style (each row is a query tied to a document image). Here, queries, qrels, docs, and corpus are separated into BEIR-standard splits. ## **REAL-MM-RAG_FinReport** - **Content**: 19 financial reports from 2005–2023. - **Size**: 2,687 pages. - **Composition**: Includes both textual data and structured tables. - **Purpose**: Designed to test model performance on table-heavy financial data retrieval. ## Format The dataset is provided under the `"test"` split and contains the following subsets: - **queries**: - `query-id` (string) - `query` (string) - `rephrase_level_1/2/3` (string) - `language` (string) - **qrels**: - `query-id` (string) - `corpus-id` (string) - `answer` (string) - `score` (int, relevance = 1) - **docs**: - `doc-id` (string) - **corpus**: - `corpus-id` (string, unique per image) - `image` (stored as PIL.Image) - `image_filename` (string, filename without extension) - `doc-id` (string, extracted from filename) - **default**: alias of `queries` (for convenience). ## Source Paper ```bibtex @misc{wasserman2025realmmragrealworldmultimodalretrieval, title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark}, author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky}, year={2025}, eprint={2502.12342}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2502.12342}, } ```

# 适配BEIR格式的REAL-MM-RAG_FinReport数据集 ## 数据集摘要 本数据集为下述Hugging Face数据集的**BEIR兼容版本**: - [`ibm-research/REAL-MM-RAG_FinReport`](https://huggingface.co/datasets/ibm-research/REAL-MM-RAG_FinReport) 本数据集已按照BEIR标准格式重构,用于检索场景下的模型评估。原始数据集采用问答(QA)格式,每一行均为一个与文档图像绑定的查询。本版本中,查询集(queries)、相关度标注集(qrels)、文档子集(docs)与语料库(corpus)已按照BEIR标准划分为独立子集。 ## REAL-MM-RAG_FinReport数据集概况 - **内容**:包含2005年至2023年的19份财务报告 - **规模**:共计2687页 - **构成**:同时包含文本数据与结构化表格 - **用途**:专为测试模型在表格密集型财务数据检索任务中的性能表现而设计 ## 数据集格式 本数据集仅包含`"test"`划分,并包含以下子集: - **查询集(queries)**: - `query-id`(字符串类型,查询唯一标识符) - `query`(字符串类型,原始查询文本) - `rephrase_level_1/2/3`(字符串类型,1/2/3级重写查询文本) - `language`(字符串类型,查询语言) - **相关度标注集(qrels)**: - `query-id`(字符串类型,关联查询的唯一标识符) - `corpus-id`(字符串类型,关联语料的唯一标识符) - `answer`(字符串类型,对应标准答案) - `score`(整数类型,相关度分值,1代表相关) - **文档子集(docs)**: - `doc-id`(字符串类型,文档唯一标识符) - **语料库(corpus)**: - `corpus-id`(字符串类型,单图像唯一标识符) - `image`(以PIL.Image格式存储的图像数据) - `image_filename`(字符串类型,不含扩展名的图像文件名) - `doc-id`(字符串类型,从文件名提取的文档标识符) - **default**:为便于使用,为`queries`的别名。 ## 源论文 bibtex @misc{wasserman2025realmmragrealworldmultimodalretrieval, title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark}, author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky}, year={2025}, eprint={2502.12342}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2502.12342}, }
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作