leideng/longbench-view

Name: leideng/longbench-view
Creator: leideng
Published: 2026-04-11 10:19:57
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/leideng/longbench-view

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - question-answering - text-generation - summarization - text-classification language: - en - zh tags: - Long Context size_categories: - 1K<n<10K configs: - config_name: 2wikimqa default: true data_files: - split: test path: "data/2wikimqa.jsonl" - config_name: 2wikimqa_e data_files: - split: test path: "data/2wikimqa_e.jsonl" - config_name: dureader data_files: - split: test path: "data/dureader.jsonl" - config_name: gov_report data_files: - split: test path: "data/gov_report.jsonl" - config_name: gov_report_e data_files: - split: test path: "data/gov_report_e.jsonl" - config_name: hotpotqa data_files: - split: test path: "data/hotpotqa.jsonl" - config_name: hotpotqa_e data_files: - split: test path: "data/hotpotqa_e.jsonl" - config_name: lcc_e data_files: - split: test path: "data/lcc_e.jsonl" - config_name: lcc data_files: - split: test path: "data/lcc.jsonl" - config_name: lsht data_files: - split: test path: "data/lsht.jsonl" - config_name: multifieldqa_en data_files: - split: test path: "data/multifieldqa_en.jsonl" - config_name: multifieldqa_en_e data_files: - split: test path: "data/multifieldqa_en_e.jsonl" - config_name: multifieldqa_zh data_files: - split: test path: "data/multifieldqa_zh.jsonl" - config_name: multi_news_e data_files: - split: test path: "data/multi_news_e.jsonl" - config_name: multi_news data_files: - split: test path: "data/multi_news.jsonl" - config_name: musique data_files: - split: test path: "data/musique.jsonl" - config_name: narrativeqa data_files: - split: test path: "data/narrativeqa.jsonl" - config_name: passage_count data_files: - split: test path: "data/passage_count.jsonl" - config_name: passage_count_e data_files: - split: test path: "data/passage_count_e.jsonl" - config_name: passage_retrieval_en data_files: - split: test path: "data/passage_retrieval_en.jsonl" - config_name: passage_retrieval_en_e data_files: - split: test path: "data/passage_retrieval_en_e.jsonl" - config_name: passage_retrieval_zh data_files: - split: test path: "data/passage_retrieval_zh.jsonl" - config_name: qasper data_files: - split: test path: "data/qasper.jsonl" - config_name: qasper_e data_files: - split: test path: "data/qasper_e.jsonl" - config_name: qmsum data_files: - split: test path: "data/qmsum.jsonl" - config_name: repobench-p data_files: - split: test path: "data/repobench-p.jsonl" - config_name: repobench-p_e data_files: - split: test path: "data/repobench-p_e.jsonl" - config_name: samsum_e data_files: - split: test path: "data/samsum_e.jsonl" - config_name: samsum data_files: - split: test path: "data/samsum.jsonl" - config_name: trec_e data_files: - split: test path: "data/trec_e.jsonl" - config_name: trec data_files: - split: test path: "data/trec.jsonl" - config_name: triviaqa data_files: - split: test path: "data/triviaqa.jsonl" - config_name: triviaqa_e data_files: - split: test path: "data/triviaqa_e.jsonl" - config_name: vcsum data_files: - split: test path: "data/vcsum.jsonl" --- # Introduction **LongBench** is the first benchmark for bilingual, multitask, and comprehensive assessment of **long context understanding** capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion. We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long context scenarios (such as manual annotation costs or API call costs). Therefore, we adopt a fully automated evaluation method, aimed at measuring and evaluating the model's ability to understand long contexts at the lowest cost. LongBench includes 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k, and a total of 4,750 test data. For detailed statistics and construction methods of LongBench tasks, please refer [here](task.md). In addition, we provide LongBench-E, a test set with a more uniform length distribution constructed by uniform sampling, with comparable amounts of data in the 0-4k, 4k-8k, and 8k+ length intervals to provide an analysis of the model's performance variations at different input lengths. Github Repo for LongBench: https://github.com/THUDM/LongBench Arxiv Paper for LongBench: https://arxiv.org/pdf/2308.14508.pdf # How to use it? #### Loading Data ```python from datasets import load_dataset datasets = ["narrativeqa", "qasper", "multifieldqa_en", "multifieldqa_zh", "hotpotqa", "2wikimqa", "musique", \ "dureader", "gov_report", "qmsum", "multi_news", "vcsum", "trec", "triviaqa", "samsum", "lsht", \ "passage_count", "passage_retrieval_en", "passage_retrieval_zh", "lcc", "repobench-p"] for dataset in datasets: data = load_dataset('THUDM/LongBench', dataset, split='test') ``` Similarly, you can load the **LongBench-E** data ```python from datasets import load_dataset datasets = ["qasper", "multifieldqa_en", "hotpotqa", "2wikimqa", "gov_report", "multi_news", "trec", \ "triviaqa", "samsum", "passage_count", "passage_retrieval_en", "lcc", "repobench-p"] for dataset in datasets: data = load_dataset('THUDM/LongBench', f"{dataset}_e", split='test') ``` Alternatively, you can download the folder from [this link](https://huggingface.co/datasets/THUDM/LongBench/resolve/main/data.zip) to load the data. #### Data Format All data in **LongBench** (LongBench-E) are standardized to the following format: ```json { "input": "The input/command for the task, usually short, such as questions in QA, queries in Few-shot tasks, etc", "context": "The long context required for the task, such as documents, cross-file code, few-shot examples in Few-shot tasks", "answers": "A List of all true answers", "length": "Total length of the first three items (counted in characters for Chinese and words for English)", "dataset": "The name of the dataset to which this piece of data belongs", "language": "The language of this piece of data", "all_classes": "All categories in classification tasks, null for non-classification tasks", "_id": "Random id for each piece of data" } ``` #### Evaluation This repository provides data download for LongBench. If you wish to use this dataset for automated evaluation, please refer to our [github](https://github.com/THUDM/LongBench). # Task statistics | Task | Task Type | Eval metric | Avg len |Language | \#Sample | | :-------- | :-----------:| :-----------: |:-------: | :-----------: |:--------: | | HotpotQA | Multi-doc QA | F1 |9,151 |EN |200 | | 2WikiMultihopQA| Multi-doc QA | F1 |4,887 |EN |200 | | MuSiQue| Multi-doc QA | F1 |11,214 |EN |200 | | DuReader| Multi-doc QA | Rouge-L |15,768 |ZH |200 | | MultiFieldQA-en| Single-doc QA | F1 |4,559 |EN |150 | | MultiFieldQA-zh| Single-doc QA | F1 |6,701 |ZH |200 | | NarrativeQA| Single-doc QA | F1 |18,409 |EN |200 | | Qasper| Single-doc QA | F1 |3,619 |EN |200 | | GovReport| Summarization | Rouge-L |8,734 |EN |200 | | QMSum| Summarization | Rouge-L |10,614 |EN |200 | | MultiNews| Summarization | Rouge-L |2,113 |EN |200 | | VCSUM| Summarization | Rouge-L |15,380 |ZH |200 | | TriviaQA| Few shot | F1 |8,209 |EN |200 | | SAMSum| Few shot | Rouge-L |6,258 |EN |200 | | TREC| Few shot | Accuracy |5,177 |EN |200 | | LSHT| Few shot | Accuracy |22,337 |ZH |200 | | PassageRetrieval-en| Synthetic | Accuracy |9,289 |EN |200 | | PassageCount| Synthetic | Accuracy |11,141 |EN |200 | | PassageRetrieval-zh | Synthetic | Accuracy |6,745 |ZH |200 | | LCC| Code | Edit Sim |1,235 |Python/C#/Java |500 | | RepoBench-P| Code | Edit Sim |4,206 |Python/Java |500 | > Note: In order to avoid discrepancies caused by different tokenizers, we use the word count (using Python's split function) to calculate the average length of English datasets and code datasets, and use the character count to calculate the average length of Chinese datasets. # Task description | Task | Task Description | | :---------------- | :----------------------------------------------------------- | | HotpotQA | Answer related questions based on multiple given documents | | 2WikiMultihopQA | Answer related questions based on multiple given documents | | MuSiQue | Answer related questions based on multiple given documents | | DuReader | Answer related Chinese questions based on multiple retrieved documents | | MultiFieldQA-en | Answer English questions based on a long article, which comes from a relatively diverse field | | MultiFieldQA-zh | Answer Chinese questions based on a long article, which comes from a relatively diverse field | | NarrativeQA | Answer questions based on stories or scripts, including understanding of important elements such as characters, plots, themes, etc. | | Qasper | Answer questions based on a NLP research paper, questions proposed and answered by NLP practitioners | | GovReport | A summarization task that requires summarizing government work reports | | MultiNews | A multi-doc summarization that requires summarizing over multiple news | | QMSum | A summarization task that requires summarizing meeting records based on user queries | | VCSUM | A summarization task that requires summarizing Chinese meeting records | | SAMSum | A dialogue summarization task, providing several few-shot examples | | TriviaQA | Single document question answering task, providing several few-shot examples | | NQ | Single document question answering task, providing several few-shot examples | | TREC | A classification task that requires categorizing questions, includes 50 categories in total | | LSHT | A Chinese classification task that requires categorizing news, includes 24 categories in total | | PassageRetrieval-en | Given 30 English Wikipedia paragraphs, determine which paragraph the given summary corresponds to | | PassageCount | Determine the total number of different paragraphs in a given repetitive article | | PassageRetrieval-zh | Given several Chinese paragraphs from the C4 data set, determine which paragraph the given abstract corresponds to | | LCC | Given a long piece of code, predict the next line of code | | RepoBench-P | Given code in multiple files within a GitHub repository (including cross-file dependencies), predict the next line of code | # Task construction > Note: For all tasks constructed from existing datasets, we use data from the validation or test set of the existing dataset (except for VCSUM). - The tasks of [HotpotQA](https://hotpotqa.github.io/), [2WikiMultihopQA](https://aclanthology.org/2020.coling-main.580/), [MuSiQue](https://arxiv.org/abs/2108.00573), and [DuReader](https://github.com/baidu/DuReader) are built based on the original datasets and processed to be suitable for long context evaluation. Specifically, for questions in the validation set, we select the evidence passage that contains the answer and several distracting articles. These articles together with the original question constitute the input of the tasks. - The tasks of MultiFiedQA-zh and MultiFieldQA-en consist of long artical data from about 10 sources, including Latex papers, judicial documents, government work reports, and PDF documents indexed by Google. For each long artical, we invite several PhD and master students to annotate, i.e., to ask questions based on the long artical and give the correct answers. To better automate evaluation, we ask the annotators to propose questions with definitive answers as much as possible. - The tasks of [NarrativeQA](https://arxiv.org/pdf/1712.07040.pdf), [Qasper](https://arxiv.org/pdf/2105.03011.pdf), [GovReport](https://arxiv.org/pdf/2104.02112.pdf), [QMSum](https://arxiv.org/pdf/2104.05938.pdf) and [MultiNews](https://aclanthology.org/P19-1102.pdf) directly use the data provided by the original papers. In the specific construction, we use the template provided by [ZeroSCROLLS](https://www.zero.scrolls-benchmark.com/) to convert the corresponding data into pure text input. - The [VCSUM](https://arxiv.org/abs/2305.05280) task is built based on the original dataset, and we design a corresponding template to convert the corresponding data into pure text input. - The [TriviaQA](https://nlp.cs.washington.edu/triviaqa/) task is constructed in the manner of [CoLT5](https://arxiv.org/abs/2303.09752), which provides several examples of question and answering based on documents, and requires the language model to answer related questions based on new documents. - The tasks of [SAMSum](https://aclanthology.org/D19-5409.pdf), [TREC](https://aclanthology.org/C02-1150.pdf) and [LSHT](http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf) are built based on the original datasets. For each question in the validation set, we sample several data from the training set to form few-shot examples. These examples together with the questions in the validation set constitute the input for this task. - The PassageRetrieval-en task is constructed based on English Wikipedia. For each piece of data, we randomly sample 30 paragraphs from English Wikipedia and select one for summarization (using GPT-3.5-Turbo). This task requires the model to give the original paragraph name to which the summary corresponds. - The PassageCount task is constructed based on the English wiki. For each piece of data, we randomly sample several passages from English Wikipedia, repeat each paragraph at random several times, and finally shuffle the paragraphs. This task requires the model to determine the total number of different paragraphs in the given context. - The PasskeyRetrieval-zh task is constructed based on [C4](https://arxiv.org/abs/1910.10683). For each piece of data, we randomly sample several Chinese paragraphs from C4 and select one of them for summarization (using GPT-3.5-Turbo). This task requires the model to give the original paragraph name to which the summary corresponds. - For the [LCC](https://arxiv.org/abs/2306.14893) task, we sample from the original code completion dataset. In the [RepoBench-P](https://arxiv.org/abs/2306.03091) task, we select the most challenging XF-F (Cross-File-First) setting from the original dataset and refer to the Oracle-Filled scenario in the paper. For each original piece of data, we randomly extract multiple cross-file code snippets, including the gold cross-file code snippet, and concatenate them as input, requiring the model to effectively use cross-file code for completion. # LongBench-E statistics | Task | Task Type | \#data in 0-4k | \#data in 4-8k | \#data in 8k+| | :--------- | :-----------:| :-----------: |:---------: | :-------------: | | HotpotQA | Multi-doc QA | 100 |100 |100 | | 2WikiMultihopQA| Multi-doc QA | 100 |100 |100 | | MultiFieldQA-en| Single-doc QA | 67 |70 |13 | | Qasper| Single-doc QA | 100 |100 |24 | | GovReport| Summarization | 100 |100 |100 | | MultiNews| Summarization | 100 |100 |94 | | TriviaQA| Few shot | 100 |100 |100 | | SAMSum| Few shot | 100 |100 |100 | | TREC| Few shot | 100 |100 |100 | | PassageRetrieval-en| Synthetic | 100 |100 |100 | | PassageCount| Synthetic | 100 |100 |100 | | LCC| Code | 100 |100 |100 | | RepoBench-P| Code | 100 |100 |100 | # Citation ``` @misc{bai2023longbench, title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding}, author={Yushi Bai and Xin Lv and Jiajie Zhang and Hongchang Lyu and Jiankai Tang and Zhidian Huang and Zhengxiao Du and Xiao Liu and Aohan Zeng and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li}, year={2023}, eprint={2308.14508}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

leideng

搜集汇总

数据集介绍

构建方式

在长文本理解评估领域，LongBench数据集通过精心设计的构建流程，整合了多样化的任务来源。该数据集主要基于现有公开数据集进行重构，例如从HotpotQA、2WikiMultihopQA等数据集中选取验证集或测试集样本，并融合干扰性文档以适配长上下文场景。对于部分任务如MultiFieldQA，则邀请专业人员对来自学术论文、司法文书等多元领域的长篇文章进行人工标注，生成具有明确答案的问题。此外，数据集还利用GPT-3.5-Turbo自动生成摘要，并采用统一采样策略构建了长度分布更均衡的LongBench-E子集，以系统化分析模型在不同输入长度下的性能变化。

使用方法

使用LongBench数据集时，可通过Hugging Face的datasets库直接加载各任务数据。用户需指定具体任务名称，如'narrativeqa'或'qasper'，并调用load_dataset函数获取测试集分割。数据集所有样本均遵循统一格式，包含输入指令、长上下文内容、参考答案列表、文本长度、任务名称、语言类别等关键字段。对于自动化评估，需参考项目GitHub仓库提供的评测脚本，以确保指标计算的一致性。数据加载后，研究者可将其输入至待测语言模型，通过对比模型输出与标准答案，系统评估模型在长文本理解任务上的综合表现。

背景与挑战

背景概述

随着大语言模型在处理长文本任务中的需求日益增长，评估其长上下文理解能力成为自然语言处理领域的关键课题。LongBench数据集由清华大学团队于2023年创建，旨在为双语、多任务的长上下文理解提供首个综合性基准测试。该数据集涵盖问答、文本生成、摘要、分类等多种任务类型，包含中英文双语数据，平均文本长度介于5千至1.5万字符之间，总计4750条测试样本。其核心研究问题聚焦于如何系统评估模型在复杂长文本场景下的语义理解、信息整合与推理能力，对推动长上下文模型的发展与优化具有重要影响力。

当前挑战

LongBench数据集致力于解决长上下文理解中的多项挑战，包括模型在超长文本中保持信息连贯性、进行多文档交叉引用、以及处理跨语言语义差异等复杂问题。在构建过程中，研究团队面临数据采集与标注的高成本压力，需从多样化的真实场景中整合长文本资源，并确保任务设计的代表性与平衡性。此外，自动化评估框架的建立需克服长文本生成质量衡量、多任务评价指标统一等技术难题，以实现在可控成本下对模型性能的精准度量。

常用场景

经典使用场景

在自然语言处理领域，长文本理解能力是评估大型语言模型性能的关键维度。LongBench数据集通过整合多文档问答、单文档问答、摘要生成、少样本学习、合成任务及代码补全等六大类别任务，为模型提供了全面的长上下文评估环境。其经典使用场景在于系统性地测试模型在跨语言、多任务设置下处理长达数千至数万字符文本的稳健性，尤其在需要综合信息检索、推理与生成的复杂任务中，成为衡量模型长程依赖捕捉能力的标准基准。

解决学术问题

长上下文理解长期以来是自然语言处理研究的核心挑战，涉及模型对远距离语义关联的建模能力。LongBench通过构建双语、多任务的长文本评估体系，解决了以往基准在文本长度、任务多样性及语言覆盖方面的局限。该数据集使得研究者能够定量分析模型在不同长度区间下的性能变化，揭示注意力机制、记忆架构等关键技术对长文本处理的影响，从而推动更高效的上下文建模方法与模型优化策略的发展。

实际应用

在实际应用中，长文本处理能力支撑着诸多关键场景，如法律文档分析、学术论文理解、跨文件代码维护及多源新闻摘要生成。LongBench涵盖的政府报告摘要、会议记录查询、代码仓库补全等任务，直接对应了智能办公、知识管理、软件开发辅助等现实需求。通过在该数据集上的评估，模型能够被优化以处理真实世界中的长篇文档，提升信息提取的准确性与生成内容的连贯性，为自动化系统在复杂文本环境中的部署提供可靠保障。

数据集最近研究