five

dwzhu/LongEmbed

收藏
Hugging Face2024-04-21 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/dwzhu/LongEmbed
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: narrativeqa data_files: - split: corpus path: narrativeqa/corpus.jsonl - split: queries path: narrativeqa/queries.jsonl - split: qrels path: narrativeqa/qrels.jsonl - config_name: summ_screen_fd data_files: - split: corpus path: summ_screen_fd/corpus.jsonl - split: queries path: summ_screen_fd/queries.jsonl - split: qrels path: summ_screen_fd/qrels.jsonl - config_name: qmsum data_files: - split: corpus path: qmsum/corpus.jsonl - split: queries path: qmsum/queries.jsonl - split: qrels path: qmsum/qrels.jsonl - config_name: 2wikimqa data_files: - split: corpus path: 2wikimqa/corpus.jsonl - split: queries path: 2wikimqa/queries.jsonl - split: qrels path: 2wikimqa/qrels.jsonl - config_name: passkey data_files: - split: corpus path: passkey/corpus.jsonl - split: queries path: passkey/queries.jsonl - split: qrels path: passkey/qrels.jsonl - config_name: needle data_files: - split: corpus path: needle/corpus.jsonl - split: queries path: needle/queries.jsonl - split: qrels path: needle/qrels.jsonl language: - en tags: - Long Context size_categories: - 1K<n<10K --- ## Introduction This repo contains the LongEmbed benchmark proposed in the paper [LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096). Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li, arxiv 2024.04. Github Repo for LongEmbed: https://github.com/dwzhu-pku/LongEmbed. **LongEmbed** is designed to benchmark long context retrieval. It includes two synthetic tasks and four real-world tasks, featuring documents of varying lengths and dispersed target information. It has been integrated into [MTEB](https://github.com/embeddings-benchmark/mteb) for the convenience of evaluation. ## How to use it? #### Loading Data LongEmbed contains six datasets: NarrativeQA, QMSum, 2WikiMultihopQA, SummScreenFD, Passkey, and Needle. Each dataset has three splits: corpus, queries, and qrels. The `corpus.jsonl` file contains the documents, the `queries.jsonl` file contains the queries, and the `qrels.jsonl` file describes the relevance. To spefic split of load each dataset, you may use: ```python from datasets import load_dataset # dataset_name in ["narrativeqa", "summ_screen_fd", "qmsum", "2wikimqa", "passkey", "needle"] # split_name in ["corpus", "queries", "qrels"] data_list = load_dataset(path="dwzhu/LongEmbed", name="dataset_name", split="split_name") ``` #### Evaluation The evaluation of LongEmbed can be easily conducted using MTEB (>=1.6.22). For the four real tasks, you can evaluate as follows: ```python from mteb import MTEB retrieval_task_list = ["LEMBSummScreenFDRetrieval", "LEMBQMSumRetrieval","LEMBWikimQARetrieval","LEMBNarrativeQARetrieval"] output_dict = {} evaluation = MTEB(tasks=retrieval_task_list) #TODO load the model before evaluation results = evaluation.run(model,output_folder=args.output_dir, overwrite_results=True, batch_size=args.batch_size,verbosity=0) for key, value in results.items(): split = "test" if "test" in value else "validation" output_dict[key] = {"ndcg@1": value[split]["ndcg_at_1"], "ndcg@10": value[split]["ndcg_at_10"]} print(output_dict) ``` For the two synthetic tasks, since we examine a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens, an additional parameter of `context_length` is required. You may evaluate as follows: ```python from mteb import MTEB needle_passkey_task_list = ["LEMBNeedleRetrieval", "LEMBPasskeyRetrieval"] output_dict = {} context_length_list = [256, 512, 1024, 2048, 4096, 8192, 16384, 32768] evaluation = MTEB(tasks=needle_passkey_task_list) #TODO load the model before evaluation results = evaluation.run(model, output_folder=args.output_dir, overwrite_results=True,batch_size=args.batch_size,verbosity=0) for key, value in results.items(): needle_passkey_score_list = [] for ctx_len in context_length_list: needle_passkey_score_list.append([ctx_len, value[f"test_{ctx_len}"]["ndcg_at_1"]]) needle_passkey_score_list.append(["avg", sum([x[1] for x in needle_passkey_score_list])/len(context_length_list)]) output_dict[key] = {item[0]: item[1] for item in needle_passkey_score_list} print(output_dict) ``` ## Task Description LongEmbed includes 4 real-world retrieval tasks curated from long-form QA and summarization. Note that for QA and summarization datasets, we use the questions and summaries as queries, respectively. - [NarrativeQA](https://huggingface.co/datasets/narrativeqa): A QA dataset comprising long stories averaging 50,474 words and corresponding questions about specific content such as characters, events. We adopt the `test` set of the original dataset. - [2WikiMultihopQA](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e): A multi-hop QA dataset featuring questions with up to 5 hops, synthesized through manually designed templates to prevent shortcut solutions. We use the `test` split of the length-uniformly sampled version from [LongBench](https://huggingface.co/datasets/THUDM/LongBench). - [QMSum](https://huggingface.co/datasets/tau/scrolls/blob/main/qmsum.zip): A query-based meeting summarization dataset that requires selecting and summarizing relevant segments of meetings in response to queries. We use the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls). Since its test set does not include ground truth summarizations, and its validation set only have 60 documents, which is too small for document retrieval, we include the `train` set in addition to the `validation` set. - [SummScreenFD](https://huggingface.co/datasets/tau/scrolls/blob/main/summ_screen_fd.zip): A screenplay summarization dataset comprising pairs of TV series transcripts and human-written summaries. Similar to QMSum, its plot details are scattered throughout the transcript and must be integrated to form succinct descriptions in the summary. We use `validation` set of the version processed by [SCROLLS](https://huggingface.co/datasets/tau/scrolls). We also include two synthetic tasks, namely needle and passkey retrieval. The former is tailored from the [Needle-in-a-Haystack Retrieval](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) for LLMs. The later is adopted from [Personalized Passkey Retrieval](https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval), with slight change for the efficiency of evaluation. The advantage of synthetic data is that we can flexibly control context length and distribution of target information. For both tasks, we evaluate a broad context range of {256, 512, 1024, 2048, 4096, 8192, 16384, 32768} tokens. For each context length, we include 50 test samples, each comprising 1 query and 100 candidate documents. ## Task Statistics | Dataset | Domain | # Queries | # Docs | Avg. Query Words | Avg. Doc Words | |---------|--------|-----------|--------|------------------|----------------| | NarrativeQA | Literature, File | 10,449 | 355 | 9 | 50,474 | | QMSum | Meeting | 1,527 | 197 | 71 | 10,058 | | 2WikimQA | Wikipedia | 300 | 300 | 12 | 6,132 | | SummScreenFD | ScreenWriting | 336 | 336 | 102 | 5,582 | | Passkey | Synthetic | 400 | 800 | 11 | - | | Needle | Synthetic | 400 | 800 | 7 | - | ## Citation If you find our paper helpful, please consider cite as follows: ``` @article{zhu2024longembed, title={LongEmbed: Extending Embedding Models for Long Context Retrieval}, author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian}, journal={arXiv preprint arXiv:2404.12096}, year={2024} } ```
提供机构:
dwzhu
原始信息汇总

数据集概述

数据集配置

  1. NarrativeQA

    • 数据文件:
      • corpus: narrativeqa/corpus.jsonl
      • queries: narrativeqa/queries.jsonl
      • qrels: narrativeqa/qrels.jsonl
  2. SummScreenFD

    • 数据文件:
      • corpus: summ_screen_fd/corpus.jsonl
      • queries: summ_screen_fd/queries.jsonl
      • qrels: summ_screen_fd/qrels.jsonl
  3. QMSum

    • 数据文件:
      • corpus: qmsum/corpus.jsonl
      • queries: qmsum/queries.jsonl
      • qrels: qmsum/qrels.jsonl
  4. 2WikiMultihopQA

    • 数据文件:
      • corpus: 2wikimqa/corpus.jsonl
      • queries: 2wikimqa/queries.jsonl
      • qrels: 2wikimqa/qrels.jsonl
  5. Passkey

    • 数据文件:
      • corpus: passkey/corpus.jsonl
      • queries: passkey/queries.jsonl
      • qrels: passkey/qrels.jsonl
  6. Needle

    • 数据文件:
      • corpus: needle/corpus.jsonl
      • queries: needle/queries.jsonl
      • qrels: needle/qrels.jsonl

语言

  • 主要语言: 英语 (en)

标签

  • 数据集标签: Long Context

大小分类

  • 数据集大小: 1K<n<10K

数据集使用

加载数据

  • 加载方法: 使用 load_dataset 函数,指定数据集名称和分割类型。

评估

  • 评估工具: MTEB (>=1.6.22)
  • 评估方法: 对于真实任务,使用特定的任务列表进行评估;对于合成任务,需要指定上下文长度。

任务描述

  • 真实世界任务:

    • NarrativeQA: 长故事QA数据集。
    • 2WikiMultihopQA: 多跳QA数据集。
    • QMSum: 查询驱动的会议摘要数据集。
    • SummScreenFD: 剧本摘要数据集。
  • 合成任务:

    • Needle: 从Needle-in-a-Haystack Retrieval定制。
    • Passkey: 从Personalized Passkey Retrieval调整。

任务统计

数据集 领域 查询数量 文档数量 平均查询词数 平均文档词数
NarrativeQA 文学, 文件 10,449 355 9 50,474
QMSum 会议 1,527 197 71 10,058
2WikiMultihopQA 维基百科 300 300 12 6,132
SummScreenFD 剧本写作 336 336 102 5,582
Passkey 合成 400 800 11 -
Needle 合成 400 800 7 -

引用信息

  • 论文: LongEmbed: Extending Embedding Models for Long Context Retrieval

  • 作者: Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

  • 发表时间: 2024

  • 引用格式:

    @article{zhu2024longembed, title={LongEmbed: Extending Embedding Models for Long Context Retrieval}, author={Zhu, Dawei and Wang, Liang and Yang, Nan and Song, Yifan and Wu, Wenhao and Wei, Furu and Li, Sujian}, journal={arXiv preprint arXiv:2404.12096}, year={2024} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作