dwzhu/LongEmbed
收藏Hugging Face2024-04-21 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/dwzhu/LongEmbed
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: narrativeqa
data_files:
- split: corpus
path: narrativeqa/corpus.jsonl
- split: queries
path: narrativeqa/queries.jsonl
- split: qrels
path: narrativeqa/qrels.jsonl
- config_name: summ_screen_fd
data_files:
- split: corpus
path: summ_screen_fd/corpus.jsonl
- split: queries
path: summ_screen_fd/queries.jsonl
- split: qrels
path: summ_screen_fd/qrels.jsonl
- config_name: qmsum
data_files:
- split: corpus
path: qmsum/corpus.jsonl
- split: queries
path: qmsum/queries.jsonl
- split: qrels
path: qmsum/qrels.jsonl
- config_name: 2wikimqa
data_files:
- split: corpus
path: 2wikimqa/corpus.jsonl
- split: queries
path: 2wikimqa/queries.jsonl
- split: qrels
path: 2wikimqa/qrels.jsonl
- config_name: passkey
data_files:
- split: corpus
path: passkey/corpus.jsonl
- split: queries
path: passkey/queries.jsonl
- split: qrels
path: passkey/qrels.jsonl
- config_name: needle
data_files:
- split: corpus
path: needle/corpus.jsonl
- split: queries
path: needle/queries.jsonl
- split: qrels
path: needle/qrels.jsonl
language:
- en
tags:
- Long Context
size_categories:
- 1K<n<10K
---
## Introduction
This repo contains the LongEmbed benchmark proposed in the paper [LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li, arxiv 2024.04. Github Repo for LongEmbed: https://github.com/dwzhu-pku/LongEmbed.
**LongEmbed** is designed to benchmark long context retrieval. It includes two synthetic tasks and four real-world tasks, featuring documents of varying lengths and dispersed target information. It has been integrated into [MTEB](https://github.com/embeddings-benchmark/mteb) for the convenience of evaluation.
## How to use it?
#### Loading Data
LongEmbed contains six datasets: NarrativeQA, QMSum, 2WikiMultihopQA, SummScreenFD, Passkey, and Needle. Each dataset has three splits: corpus, queries, and qrels. The `corpus.jsonl` file contains the documents, the `queries.jsonl` file contains the queries, and the `qrels.jsonl` file describes the relevance. To spefic split of load each dataset, you may use:
```python
from datasets import load_dataset
# dataset_name in ["narrativeqa", "summ_screen_fd", "qmsum", "2wikimqa", "passkey", "needle"]
# split_name in ["corpus", "queries", "qrels"]
data_list = load_dataset(path="dwzhu/LongEmbed", name="dataset_name", split="split_name")
```
#### Evaluation
The evaluation of LongEmbed can be easily conducted using MTEB (>=1.6.22). For the four real tasks, you can evaluate as follows:
```python
from mteb import MTEB
retrieval_task_list = ["LEMBSummScreenFDRetrieval", "LEMBQMSumRetrieval","LEMBWikimQARetrieval","LEMBNarrativeQARetrieval"]
output_dict = {}
evaluation = MTEB(tasks=retrieval_task_list)
#TODO load the model before evaluation
results = evaluation.run(model,output_folder=args.output_dir, overwrite_results=True, batch_size=args.batch_size,verbosity=0)
for key, value in results.items():
split = "test" if "test" in value else "validation"
output_dict[key] = {"ndcg@1": value[split]["ndcg_at_1"], "ndcg@10": value[split]["ndcg_at_10"]}
print(output_dict)
```
For the two synthetic tasks, since we examine a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens, an additional parameter of `context_length` is required. You may evaluate as follows:
```python
from mteb import MTEB
needle_passkey_task_list = ["LEMBNeedleRetrieval", "LEMBPasskeyRetrieval"]
output_dict = {}
context_length_list = [256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
evaluation = MTEB(tasks=needle_passkey_task_list)
#TODO load the model before evaluation
results = evaluation.run(model, output_folder=args.output_dir, overwrite_results=True,batch_size=args.batch_size,verbosity=0)
for key, value in results.items():
needle_passkey_score_list = []
for ctx_len in context_length_list:
needle_passkey_score_list.append([ctx_len, value[f"test_{ctx_len}"]["ndcg_at_1"]])
needle_passkey_score_list.append(["avg", sum([x[1] for x in needle_passkey_score_list])/len(context_length_list)])
output_dict[key] = {item[0]: item[1] for item in needle_passkey_score_list}
print(output_dict)
```
## Task Description
LongEmbed includes 4 real-world retrieval tasks curated from long-form QA and summarization. Note that for QA and summarization datasets, we use the questions and summaries as queries, respectively.
- [NarrativeQA](https://huggingface.co/datasets/narrativeqa): A QA dataset comprising long stories averaging 50,474 words and corresponding questions about specific content such as characters, events. We adopt the `test` set of the original dataset.
- [2WikiMultihopQA](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e): A multi-hop QA dataset featuring questions with up to 5 hops, synthesized through manually designed templates to prevent shortcut solutions. We use the `test` split of the length-uniformly sampled version from [LongBench](https://huggingface.co/datasets/THUDM/LongBench).
- [QMSum](https://huggingface.co/datasets/tau/scrolls/blob/main/qmsum.zip): A query-based meeting summarization dataset that requires selecting and summarizing relevant segments of meetings in response to queries. We use the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls). Since its test set does not include ground truth summarizations, and its validation set only have 60 documents, which is too small for document retrieval, we include the `train` set in addition to the `validation` set.
- [SummScreenFD](https://huggingface.co/datasets/tau/scrolls/blob/main/summ_screen_fd.zip): A screenplay summarization dataset comprising pairs of TV series transcripts and human-written summaries. Similar to QMSum, its plot details are scattered throughout the transcript and must be integrated to form succinct descriptions in the summary. We use `validation` set of the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls).
We also include two synthetic tasks, namely needle and passkey retrieval. The former is tailored from the [Needle-in-a-Haystack Retrieval](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) for LLMs. The later is adopted from [Personalized Passkey Retrieval](https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval), with slight change for the efficiency of evaluation. The advantage of synthetic data is that we can flexibly control context length and distribution of target information. For both tasks, we evaluate a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens. For each context length, we include 50 test samples, each comprising 1 query and 100 candidate documents.
## Task Statistics
| Dataset | Domain | # Queries | # Docs | Avg. Query Words | Avg. Doc Words |
|---------|--------|-----------|--------|------------------|----------------|
| NarrativeQA | Literature, File | 10,449 | 355 | 9 | 50,474 |
| QMSum | Meeting | 1,527 | 197 | 71 | 10,058 |
| 2WikimQA | Wikipedia | 300 | 300 | 12 | 6,132 |
| SummScreenFD | ScreenWriting | 336 | 336 | 102 | 5,582 |
| Passkey | Synthetic | 400 | 800 | 11 | - |
| Needle | Synthetic | 400 | 800 | 7 | - |
## Citation
If you find our paper helpful, please consider cite as follows:
```
@article{zhu2024longembed,
title={LongEmbed: Extending Embedding Models for Long Context Retrieval},
author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian},
journal={arXiv preprint arXiv:2404.12096},
year={2024}
}
```
提供机构:
dwzhu
原始信息汇总
数据集概述
数据集配置
-
NarrativeQA
- 数据文件:
- corpus:
narrativeqa/corpus.jsonl - queries:
narrativeqa/queries.jsonl - qrels:
narrativeqa/qrels.jsonl
- corpus:
- 数据文件:
-
SummScreenFD
- 数据文件:
- corpus:
summ_screen_fd/corpus.jsonl - queries:
summ_screen_fd/queries.jsonl - qrels:
summ_screen_fd/qrels.jsonl
- corpus:
- 数据文件:
-
QMSum
- 数据文件:
- corpus:
qmsum/corpus.jsonl - queries:
qmsum/queries.jsonl - qrels:
qmsum/qrels.jsonl
- corpus:
- 数据文件:
-
2WikiMultihopQA
- 数据文件:
- corpus:
2wikimqa/corpus.jsonl - queries:
2wikimqa/queries.jsonl - qrels:
2wikimqa/qrels.jsonl
- corpus:
- 数据文件:
-
Passkey
- 数据文件:
- corpus:
passkey/corpus.jsonl - queries:
passkey/queries.jsonl - qrels:
passkey/qrels.jsonl
- corpus:
- 数据文件:
-
Needle
- 数据文件:
- corpus:
needle/corpus.jsonl - queries:
needle/queries.jsonl - qrels:
needle/qrels.jsonl
- corpus:
- 数据文件:
语言
- 主要语言: 英语 (
en)
标签
- 数据集标签: Long Context
大小分类
- 数据集大小: 1K<n<10K
数据集使用
加载数据
- 加载方法: 使用
load_dataset函数,指定数据集名称和分割类型。
评估
- 评估工具: MTEB (>=1.6.22)
- 评估方法: 对于真实任务,使用特定的任务列表进行评估;对于合成任务,需要指定上下文长度。
任务描述
-
真实世界任务:
- NarrativeQA: 长故事QA数据集。
- 2WikiMultihopQA: 多跳QA数据集。
- QMSum: 查询驱动的会议摘要数据集。
- SummScreenFD: 剧本摘要数据集。
-
合成任务:
- Needle: 从Needle-in-a-Haystack Retrieval定制。
- Passkey: 从Personalized Passkey Retrieval调整。
任务统计
| 数据集 | 领域 | 查询数量 | 文档数量 | 平均查询词数 | 平均文档词数 |
|---|---|---|---|---|---|
| NarrativeQA | 文学, 文件 | 10,449 | 355 | 9 | 50,474 |
| QMSum | 会议 | 1,527 | 197 | 71 | 10,058 |
| 2WikiMultihopQA | 维基百科 | 300 | 300 | 12 | 6,132 |
| SummScreenFD | 剧本写作 | 336 | 336 | 102 | 5,582 |
| Passkey | 合成 | 400 | 800 | 11 | - |
| Needle | 合成 | 400 | 800 | 7 | - |
引用信息
-
论文: LongEmbed: Extending Embedding Models for Long Context Retrieval
-
作者: Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
-
发表时间: 2024
-
引用格式:
@article{zhu2024longembed, title={LongEmbed: Extending Embedding Models for Long Context Retrieval}, author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian}, journal={arXiv preprint arXiv:2404.12096}, year={2024} }



