dwzhu/LongEmbed

Name: dwzhu/LongEmbed
Creator: dwzhu
Published: 2024-04-21 02:43:37
License: 暂无描述

Hugging Face2024-04-21 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/dwzhu/LongEmbed

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: narrativeqa data_files: - split: corpus path: narrativeqa/corpus.jsonl - split: queries path: narrativeqa/queries.jsonl - split: qrels path: narrativeqa/qrels.jsonl - config_name: summ_screen_fd data_files: - split: corpus path: summ_screen_fd/corpus.jsonl - split: queries path: summ_screen_fd/queries.jsonl - split: qrels path: summ_screen_fd/qrels.jsonl - config_name: qmsum data_files: - split: corpus path: qmsum/corpus.jsonl - split: queries path: qmsum/queries.jsonl - split: qrels path: qmsum/qrels.jsonl - config_name: 2wikimqa data_files: - split: corpus path: 2wikimqa/corpus.jsonl - split: queries path: 2wikimqa/queries.jsonl - split: qrels path: 2wikimqa/qrels.jsonl - config_name: passkey data_files: - split: corpus path: passkey/corpus.jsonl - split: queries path: passkey/queries.jsonl - split: qrels path: passkey/qrels.jsonl - config_name: needle data_files: - split: corpus path: needle/corpus.jsonl - split: queries path: needle/queries.jsonl - split: qrels path: needle/qrels.jsonl language: - en tags: - Long Context size_categories: - 1K<n<10K --- ## Introduction This repo contains the LongEmbed benchmark proposed in the paper [LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li, arxiv 2024.04. Github Repo for LongEmbed: https://github.com/dwzhu-pku/LongEmbed. **LongEmbed** is designed to benchmark long context retrieval. It includes two synthetic tasks and four real-world tasks, featuring documents of varying lengths and dispersed target information. It has been integrated into [MTEB](https://github.com/embeddings-benchmark/mteb) for the convenience of evaluation. ## How to use it? #### Loading Data LongEmbed contains six datasets: NarrativeQA, QMSum, 2WikiMultihopQA, SummScreenFD, Passkey, and Needle. Each dataset has three splits: corpus, queries, and qrels. The `corpus.jsonl` file contains the documents, the `queries.jsonl` file contains the queries, and the `qrels.jsonl` file describes the relevance. To spefic split of load each dataset, you may use: ```python from datasets import load_dataset # dataset_name in ["narrativeqa", "summ_screen_fd", "qmsum", "2wikimqa", "passkey", "needle"] # split_name in ["corpus", "queries", "qrels"] data_list = load_dataset(path="dwzhu/LongEmbed", name="dataset_name", split="split_name") ``` #### Evaluation The evaluation of LongEmbed can be easily conducted using MTEB (>=1.6.22). For the four real tasks, you can evaluate as follows: ```python from mteb import MTEB retrieval_task_list = ["LEMBSummScreenFDRetrieval", "LEMBQMSumRetrieval","LEMBWikimQARetrieval","LEMBNarrativeQARetrieval"] output_dict = {} evaluation = MTEB(tasks=retrieval_task_list) #TODO load the model before evaluation results = evaluation.run(model,output_folder=args.output_dir, overwrite_results=True, batch_size=args.batch_size,verbosity=0) for key, value in results.items(): split = "test" if "test" in value else "validation" output_dict[key] = {"ndcg@1": value[split]["ndcg_at_1"], "ndcg@10": value[split]["ndcg_at_10"]} print(output_dict) ``` For the two synthetic tasks, since we examine a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens, an additional parameter of `context_length` is required. You may evaluate as follows: ```python from mteb import MTEB needle_passkey_task_list = ["LEMBNeedleRetrieval", "LEMBPasskeyRetrieval"] output_dict = {} context_length_list = [256, 512, 1024, 2048, 4096, 8192, 16384, 32768] evaluation = MTEB(tasks=needle_passkey_task_list) #TODO load the model before evaluation results = evaluation.run(model, output_folder=args.output_dir, overwrite_results=True,batch_size=args.batch_size,verbosity=0) for key, value in results.items(): needle_passkey_score_list = [] for ctx_len in context_length_list: needle_passkey_score_list.append([ctx_len, value[f"test_{ctx_len}"]["ndcg_at_1"]]) needle_passkey_score_list.append(["avg", sum([x[1] for x in needle_passkey_score_list])/len(context_length_list)]) output_dict[key] = {item[0]: item[1] for item in needle_passkey_score_list} print(output_dict) ``` ## Task Description LongEmbed includes 4 real-world retrieval tasks curated from long-form QA and summarization. Note that for QA and summarization datasets, we use the questions and summaries as queries, respectively. - [NarrativeQA](https://huggingface.co/datasets/narrativeqa): A QA dataset comprising long stories averaging 50,474 words and corresponding questions about specific content such as characters, events. We adopt the `test` set of the original dataset. - [2WikiMultihopQA](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e): A multi-hop QA dataset featuring questions with up to 5 hops, synthesized through manually designed templates to prevent shortcut solutions. We use the `test` split of the length-uniformly sampled version from [LongBench](https://huggingface.co/datasets/THUDM/LongBench). - [QMSum](https://huggingface.co/datasets/tau/scrolls/blob/main/qmsum.zip): A query-based meeting summarization dataset that requires selecting and summarizing relevant segments of meetings in response to queries. We use the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls). Since its test set does not include ground truth summarizations, and its validation set only have 60 documents, which is too small for document retrieval, we include the `train` set in addition to the `validation` set. - [SummScreenFD](https://huggingface.co/datasets/tau/scrolls/blob/main/summ_screen_fd.zip): A screenplay summarization dataset comprising pairs of TV series transcripts and human-written summaries. Similar to QMSum, its plot details are scattered throughout the transcript and must be integrated to form succinct descriptions in the summary. We use `validation` set of the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls). We also include two synthetic tasks, namely needle and passkey retrieval. The former is tailored from the [Needle-in-a-Haystack Retrieval](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) for LLMs. The later is adopted from [Personalized Passkey Retrieval](https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval), with slight change for the efficiency of evaluation. The advantage of synthetic data is that we can flexibly control context length and distribution of target information. For both tasks, we evaluate a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens. For each context length, we include 50 test samples, each comprising 1 query and 100 candidate documents. ## Task Statistics | Dataset | Domain | # Queries | # Docs | Avg. Query Words | Avg. Doc Words | |---------|--------|-----------|--------|------------------|----------------| | NarrativeQA | Literature, File | 10,449 | 355 | 9 | 50,474 | | QMSum | Meeting | 1,527 | 197 | 71 | 10,058 | | 2WikimQA | Wikipedia | 300 | 300 | 12 | 6,132 | | SummScreenFD | ScreenWriting | 336 | 336 | 102 | 5,582 | | Passkey | Synthetic | 400 | 800 | 11 | - | | Needle | Synthetic | 400 | 800 | 7 | - | ## Citation If you find our paper helpful, please consider cite as follows: ``` @article{zhu2024longembed, title={LongEmbed: Extending Embedding Models for Long Context Retrieval}, author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian}, journal={arXiv preprint arXiv:2404.12096}, year={2024} } ```

提供机构：

dwzhu

原始信息汇总

数据集概述

数据集配置

NarrativeQA
- 数据文件:
  - corpus: narrativeqa/corpus.jsonl
  - queries: narrativeqa/queries.jsonl
  - qrels: narrativeqa/qrels.jsonl
SummScreenFD
- 数据文件:
  - corpus: summ_screen_fd/corpus.jsonl
  - queries: summ_screen_fd/queries.jsonl
  - qrels: summ_screen_fd/qrels.jsonl
QMSum
- 数据文件:
  - corpus: qmsum/corpus.jsonl
  - queries: qmsum/queries.jsonl
  - qrels: qmsum/qrels.jsonl
2WikiMultihopQA
- 数据文件:
  - corpus: 2wikimqa/corpus.jsonl
  - queries: 2wikimqa/queries.jsonl
  - qrels: 2wikimqa/qrels.jsonl
Passkey
- 数据文件:
  - corpus: passkey/corpus.jsonl
  - queries: passkey/queries.jsonl
  - qrels: passkey/qrels.jsonl
Needle
- 数据文件:
  - corpus: needle/corpus.jsonl
  - queries: needle/queries.jsonl
  - qrels: needle/qrels.jsonl

语言

主要语言: 英语 (en)

大小分类

数据集大小: 1K<n<10K

数据集使用

加载数据

加载方法: 使用 load_dataset 函数，指定数据集名称和分割类型。

评估

评估工具: MTEB (>=1.6.22)
评估方法: 对于真实任务，使用特定的任务列表进行评估；对于合成任务，需要指定上下文长度。

任务描述

真实世界任务:
- NarrativeQA: 长故事QA数据集。
- 2WikiMultihopQA: 多跳QA数据集。
- QMSum: 查询驱动的会议摘要数据集。
- SummScreenFD: 剧本摘要数据集。
合成任务:
- Needle: 从Needle-in-a-Haystack Retrieval定制。
- Passkey: 从Personalized Passkey Retrieval调整。

任务统计

数据集	领域	查询数量	文档数量	平均查询词数	平均文档词数
NarrativeQA	文学, 文件	10,449	355	9	50,474
QMSum	会议	1,527	197	71	10,058
2WikiMultihopQA	维基百科	300	300	12	6,132
SummScreenFD	剧本写作	336	336	102	5,582
Passkey	合成	400	800	11	-
Needle	合成	400	800	7	-

引用信息

论文: LongEmbed: Extending Embedding Models for Long Context Retrieval
作者: Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
发表时间: 2024
引用格式:

@article{zhu2024longembed, title={LongEmbed: Extending Embedding Models for Long Context Retrieval}, author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian}, journal={arXiv preprint arXiv:2404.12096}, year={2024} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集