LongRAG
收藏魔搭社区2026-01-06 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/LongRAG
下载链接
链接失效反馈官方服务:
资源简介:
[📃Paper](https://arxiv.org/abs/2406.15319) | [🌐Website](https://tiger-ai-lab.github.io/LongRAG/) | [💻Github](https://github.com/TIGER-AI-Lab/LongRAG) | [🛢️Datasets](https://huggingface.co/datasets/TIGER-Lab/LongRAG)
## Overview
In traditional RAG framework, the basic retrieval units are normally short. Such a design forces the retriever to search over a large corpus to find the "needle" unit.
In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal
performance. We propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". Our framework use a 4K-token retrieval unit, which is 30x longer
than before. By increasing the unit size, we significantly reduce the total units. This significantly lowers the burden of retriever, which leads to a remarkable retrieval
score. The long reader will further extract answers from the concatenation of retrievals. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3%
on HotpotQA (full-wiki), which is on par with the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
## Dataset details
| Subset Name | Brief Description |
|:-----------:|:-----------------:|
| nq | The retrieval output and the reader input for the NQ dataset. |
| nq_corpus | The grouped retrieval corpus we used for NQ in our paper. |
| hotpot_qa | The retrieval output and the reader input for the HotpotQA dataset. |
| hotpot_qa_corpus | The grouped retrieval corpus we used for HotpotQA in our paper.. |
| answer_extract_example | The in-context examples we use to extract the short (final) answer from a long answer. |
The following are the raw data we processed from.
| Subset Name | Brief Description |
|:--------------:|:--------------------------------------------:|
| nq_wiki | The processed Wiki for the NQ dataset. |
| hotpot_qa_wiki | The processed Wiki for the HotpotQA dataset. |
Please see more details below.
### nq_corpus
This is our retrieval corpus for NQ. We use the Wikipedia dumps from December 20, 2018, which contain approximately 3 million documents. Each retrieval unit in
our corpus is a group of related documents, organized by the embedded hyperlinks.
There are three fields in this dataset:
+ corpus_id: A unique ID for each retrieval unit.
+ titles: A list of titles, representing the titles of the documents in this unit.
+ text: The concatenated text of all the documents within each unit.
### hotpot_qa_corpus
This is our retrieval corpus for HotpotQA. We use the abstract paragraphs from the October 1, 2017 dump, which contain around 5 million documents. Each retrieval unit in
our corpus is a group of related documents, organized by the embedded hyperlinks.
There are three fields in this dataset:
+ corpus_id: A unique ID for each retrieval unit.
+ titles: A list of titles, representing the titles of the documents in this unit.
+ text: The concatenated text of all the documents within each unit.
### nq
This is the retrieval output and the reader input for the NQ dataset.
+ query_id: A unique ID for each test case.
+ query: The question.
+ answer: The golden label, which is a list of answers.
+ context_titles: A list of titles representing the titles of the documents in the context (concatenation of top-k retrieval units).
+ context: The input into the reader, with a length of approximately 20,000 to 30,000 tokens.
There are three splits: "full", "subset_1000", "subset_100". We suggest starting with "subset_100" for a quick start or debugging and using "subset_1000" and "full" to
obtain relatively stable results. For more details, please refer to our [codebase](https://github.com/TIGER-AI-Lab/LongRAG/).
### hotpot_qa
This is the retrieval output and the reader input for the HotpotQA dataset.
+ query_id: A unique ID for each test case.
+ query: The question.
+ answer: The golden label, which is a list of answers.
+ sp: The titles of the two supporting documents.
+ type: The question type, comparison or bridge.
+ context_titles: A list of titles representing the titles of the documents in the context (concatenation of top-k retrieval units).
+ context: The input into the reader, with a length of approximately 20,000 to 30,000 tokens.
There are three splits: "full", "subset_1000", "subset_100". We suggest starting with "subset_100" for a quick start or debugging and using "subset_1000" and "full" to
obtain relatively stable results. For more details, please refer to our [codebase](https://github.com/TIGER-AI-Lab/LongRAG/).
### answer_extract_example
These are the in-context examples we use to extract the short (final) answer from a long answer.
+ question: The question.
+ answers: he golden label, which is a list of short answers.
+ long_answer: A long answer for the given question.
For more details about the answer extraction, please refer to the Section6.1 in our [paper](https://arxiv.org/abs/2406.15319).
### nq_wiki
The processed Wiki for the NQ dataset is derived from the English Wikipedia dump from December 20, 2018. Following previous work,
some pages, such as list pages and disambiguation pages, are removed, resulting in approximately 3.2 million documents. Each row
contains information of one Wikipedia document:
+ title: The title of the document.
+ degree: The number of documents linked to or from this document.
+ abs_adj: The titles of the documents linked to or from this document are listed in the abstract paragraph.
+ full_adj: The titles of the documents linked to or from this document are listed in the whole page.
+ doc_size: The number of tokens in this document.
+ doc_dict: The text of this document.
### hotpot_qa_wiki
The processed Wiki for the HotpotQA dataset is derived from the English Wikipedia dump from October 1, 2017, which contains abstract paragraph from
approximately 5.2 million documents. Each row contains information of one Wikipedia document:
+ title: The title of the document.
+ degree: The number of documents linked to or from this document.
+ abs_adj: The titles of the documents linked to or from this document are listed in the abstract paragraph.
+ full_adj: The titles of the documents linked to or from this document are listed in the whole page.
+ doc_size: The number of tokens in this document.
+ doc_dict: The text of this document.
## Citation
```bibtex
@article{jiang2024longrag
title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
journal={arXiv preprint arXiv:2406.15319},
year={2024},
url={https://arxiv.org/abs/2406.15319}
}
```
[📃论文](https://arxiv.org/abs/2406.15319) | [🌐官网](https://tiger-ai-lab.github.io/LongRAG/) | [💻Github仓库](https://github.com/TIGER-AI-Lab/LongRAG) | [🛢️数据集](https://huggingface.co/datasets/TIGER-Lab/LongRAG)
## 概述
传统检索增强生成(Retrieval-Augmented Generation,RAG)框架中,基础检索单元通常较短。该设计迫使检索器在大规模语料库中搜索目标检索片段。与之相对,阅读器仅需从少量检索得到的单元中提取答案。这种“重检索器、轻阅读器”的不平衡设计往往会导致次优性能。为此,我们提出了全新框架LongRAG,包含“长检索器”与“长阅读器”两个模块。我们的框架采用4K Token的检索单元,长度是传统单元的30倍。通过扩大单元尺寸,我们大幅减少了总检索单元数量,显著降低了检索器的负载,从而获得了出色的检索分数。长阅读器则会进一步从拼接后的检索结果中提取最终答案。LongRAG无需任何额外训练,就在自然问题数据集(Natural Questions, NQ)上达到了62.7%的精确匹配(Exact Match, EM)分数,在HotpotQA(全维基设置)上达到了64.3%的EM分数,性能比肩当前最优模型。我们的研究为检索增强生成与长上下文大语言模型(Large Language Model,LLM)的结合提供了极具价值的发展思路。
## 数据集详情
| 子集名称 | 简要描述 |
|:-----------:|:-----------------:|
| nq | 适用于NQ数据集的检索输出与阅读器输入。 |
| nq_corpus | 本文中用于NQ数据集的分组检索语料库。 |
| hotpot_qa | 适用于HotpotQA数据集的检索输出与阅读器输入。 |
| hotpot_qa_corpus | 本文中用于HotpotQA数据集的分组检索语料库。 |
| answer_extract_example | 用于从长答案中提取短(最终)答案的上下文示例。 |
以下为我们处理得到的原始数据源子集:
| 子集名称 | 简要描述 |
|:--------------:|:--------------------------------------------:|
| nq_wiki | 针对NQ数据集处理得到的维基百科语料。 |
| hotpot_qa_wiki | 针对HotpotQA数据集处理得到的维基百科语料。 |
### nq_corpus
本数据集为NQ任务的检索语料库。我们采用了2018年12月20日的维基百科转储数据,包含约300万篇文档。本语料库中的每个检索单元均为一组通过嵌入超链接组织的相关文档。
该数据集包含三个字段:
+ corpus_id:每个检索单元的唯一标识符。
+ titles:标题列表,代表当前单元内各文档的标题。
+ text:当前单元内所有文档的拼接文本。
### hotpot_qa_corpus
本数据集为HotpotQA任务的检索语料库。我们采用了2017年10月1日的维基百科摘要段落转储数据,包含约500万篇文档。本语料库中的每个检索单元均为一组通过嵌入超链接组织的相关文档。
该数据集包含三个字段:
+ corpus_id:每个检索单元的唯一标识符。
+ titles:标题列表,代表当前单元内各文档的标题。
+ text:当前单元内所有文档的拼接文本。
### nq
本数据集为适用于NQ数据集的检索输出与阅读器输入。
+ query_id:每个测试用例的唯一标识符。
+ query:测试问题。
+ answer:金标准标签,为答案列表。
+ context_titles:上下文文档的标题列表(即Top-K检索单元的拼接结果)。
+ context:输入至阅读器的上下文文本,长度约为20000至30000 Token。
该数据集包含三个拆分:"full"、"subset_1000"、"subset_100"。我们建议先使用"subset_100"进行快速调试或验证,使用"subset_1000"与"full"以获得相对稳定的实验结果。如需更多细节,请参考我们的[代码仓库](https://github.com/TIGER-AI-Lab/LongRAG/)。
### hotpot_qa
本数据集为适用于HotpotQA数据集的检索输出与阅读器输入。
+ query_id:每个测试用例的唯一标识符。
+ query:测试问题。
+ answer:金标准标签,为答案列表。
+ sp:两个支持文档的标题。
+ type:问题类型,分为比较类(comparison)与桥梁类(bridge)。
+ context_titles:上下文文档的标题列表(即Top-K检索单元的拼接结果)。
+ context:输入至阅读器的上下文文本,长度约为20000至30000 Token。
该数据集包含三个拆分:"full"、"subset_1000"、"subset_100"。我们建议先使用"subset_100"进行快速调试或验证,使用"subset_1000"与"full"以获得相对稳定的实验结果。如需更多细节,请参考我们的[代码仓库](https://github.com/TIGER-AI-Lab/LongRAG/)。
### answer_extract_example
本数据集为用于从长答案中提取短(最终)答案的上下文示例。
+ question:测试问题。
+ answers:金标准标签,为短答案列表。
+ long_answer:针对当前问题的长答案。
如需了解答案提取的更多细节,请参考我们[论文](https://arxiv.org/abs/2406.15319)的第6.1节。
### nq_wiki
针对NQ数据集处理得到的维基百科语料源自2018年12月20日的英文维基百科转储数据。参照过往研究工作,我们移除了列表页与消歧义页,最终得到约320万篇文档。每一行代表一篇维基百科文档的信息:
+ title:文档的标题。
+ degree:与该文档存在链接关系的文档数量。
+ abs_adj:在摘要段落中列出的与该文档存在链接关系的文档标题。
+ full_adj:在全页面中列出的与该文档存在链接关系的文档标题。
+ doc_size:该文档的Token数量。
+ doc_dict:该文档的完整文本。
### hotpot_qa_wiki
针对HotpotQA数据集处理得到的维基百科语料源自2017年10月1日的英文维基百科转储数据,包含约520万篇文档的摘要段落。每一行代表一篇维基百科文档的信息:
+ title:文档的标题。
+ degree:与该文档存在链接关系的文档数量。
+ abs_adj:在摘要段落中列出的与该文档存在链接关系的文档标题。
+ full_adj:在全页面中列出的与该文档存在链接关系的文档标题。
+ doc_size:该文档的Token数量。
+ doc_dict:该文档的完整文本。
## 引用
bibtex
@article{jiang2024longrag
title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
journal={arXiv preprint arXiv:2406.15319},
year={2024},
url={https://arxiv.org/abs/2406.15319}
}
提供机构:
maas
创建时间:
2025-02-03



