LongBench-v2

Name: LongBench-v2
Creator: maas
Published: 2026-05-16 14:39:56
License: 暂无描述

魔搭社区2026-05-16 更新2025-08-02 收录

下载链接：

https://modelscope.cn/datasets/ZhipuAI/LongBench-v2

下载链接

链接失效反馈

官方服务：

资源简介：

# LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks 🌐 Project Page: https://longbench2.github.io 💻 Github Repo: https://github.com/THUDM/LongBench 📚 Arxiv Paper: https://arxiv.org/abs/2412.15204 LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring **deep understanding and reasoning** across real-world multitasks. LongBench v2 has the following features: (1) **Length**: Context length ranging from 8k to 2M words, with the majority under 128k. (2) **Difficulty**: Challenging enough that even human experts, using search tools within the document, cannot answer correctly in a short time. (3) **Coverage**: Cover various realistic scenarios. (4) **Reliability**: All in a multiple-choice question format for reliable evaluation. To elaborate, LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of **enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2**. **🔍 With LongBench v2, we are eager to find out how scaling inference-time compute will affect deep understanding and reasoning in long-context scenarios. View our 🏆 leaderboard [here](https://longbench2.github.io/#leaderboard) (updating).** # 🔨 How to use it? #### Loading Data You can download and load the **LongBench v2** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongBench-v2)): ```python from datasets import load_dataset dataset = load_dataset('THUDM/LongBench-v2', split='train') ``` Alternatively, you can download the file from [this link](https://huggingface.co/datasets/THUDM/LongBench-v2/resolve/main/data.json) to load the data. #### Data Format All data in **LongBench v2** are standardized to the following format: ```json { "_id": "Unique identifier for each piece of data", "domain": "The primary domain category of the data", "sub_domain": "The specific sub-domain category within the domain", "difficulty": "The difficulty level of the task, either 'easy' or 'hard'", "length": "The length category of the task, which can be 'short', 'medium', or 'long'", "question": "The input/command for the task, usually short, such as questions in QA, queries in many-shot learning, etc", "choice_A": "Option A", "choice_B": "Option B", "choice_C": "Option C", "choice_D": "Option D", "answer": "The groundtruth answer, denoted as A, B, C, or D", "context": "The long context required for the task, such as documents, books, code repositories, etc." } ``` #### Evaluation This repository provides data download for LongBench v2. If you wish to use this dataset for automated evaluation, please refer to our [github](https://github.com/THUDM/LongBench). # Dataset Statistics <img width="60%" alt="data_instance" src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/6i10a4KKy5WS2xGAQ8h9E.png"> <img width="70%" alt="data_instance" src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/qWMf-xKmX17terdKxu9oa.png"> # Citation ``` @article{bai2024longbench2, title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li}, journal={arXiv preprint arXiv:2412.15204}, year={2024} } ```

# LongBench v2：面向真实长上下文多任务的深度理解与推理 🌐 项目主页：https://longbench2.github.io 💻 GitHub 仓库：https://github.com/THUDM/LongBench 📚 Arxiv 论文：https://arxiv.org/abs/2412.15204 LongBench v2 旨在评估大语言模型（LLM, Large Language Model）在真实世界多任务场景中处理需要**深度理解与推理**的长上下文问题的能力。LongBench v2 具备以下四项核心特性：(1) **上下文长度**：上下文长度范围为8k到2M词，多数样本长度低于128k。(2) **任务难度**：难度极高，即便人类专家借助文档内的搜索工具，也无法在短时间内给出正确答案。(3) **覆盖范围**：涵盖多种真实应用场景。(4) **评估可靠性**：全部采用选择题格式，确保评估结果可靠。具体而言，LongBench v2 共包含503道高难度选择题，上下文长度从8k到2M词不等，涵盖六大核心任务类别：单文档问答、多文档问答、长上下文学习、长对话历史理解、代码仓库理解以及长结构化数据理解。为保障数据集的广度与实用性，我们从近百名具备多元专业背景的高学历人群中采集数据。我们同时采用自动化与人工审核流程以维持数据集的高质量与难度，最终人类专家在15分钟时限下的准确率仅为53.7%。我们的评估结果显示，表现最优的基础模型直接作答时准确率仅为50.1%。相比之下，具备更长推理过程的o1-preview模型准确率达到57.7%，超出人类基准4个百分点。上述结果凸显了**增强推理能力与扩展推理时计算资源**对于攻克LongBench v2中长上下文挑战的重要性。 **🔍 借助LongBench v2，我们旨在探究推理时计算资源的扩展将如何影响长上下文场景下的深度理解与推理能力。可通过[此处](https://longbench2.github.io/#leaderboard)查看我们的🏆排行榜（持续更新中）。** # 🔨 如何使用？ #### 数据加载你可以通过Hugging Face数据集（[🤗 HF 仓库](https://huggingface.co/datasets/THUDM/LongBench-v2)）下载并加载**LongBench v2**数据集： python from datasets import load_dataset dataset = load_dataset('THUDM/LongBench-v2', split='train') 或者，你也可以通过[该链接](https://huggingface.co/datasets/THUDM/LongBench-v2/resolve/main/data.json)下载文件以加载数据。 #### 数据格式 **LongBench v2**中的所有数据均标准化为以下格式： json { "_id": "每条数据的唯一标识符", "domain": "数据所属的一级领域类别", "sub_domain": "该领域下的具体子领域类别", "difficulty": "任务难度等级，可选值为'easy'（简单）或'hard'（困难）", "length": "任务长度类别，可选值为'short'（短）、'medium'（中）或'long'（长）", "question": "任务的输入/指令，通常较为简短，例如问答任务中的问题、多样本学习中的查询等", "choice_A": "选项A", "choice_B": "选项B", "choice_C": "选项C", "choice_D": "选项D", "answer": "标准答案，以A、B、C或D表示", "context": "任务所需的长上下文内容，例如文档、书籍、代码仓库等。" } #### 评估本仓库提供LongBench v2的数据下载服务。若你希望使用该数据集进行自动化评估，请参考我们的[GitHub仓库](https://github.com/THUDM/LongBench)。 # 数据集统计 <img width="60%" alt="data_instance" src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/6i10a4KKy5WS2xGAQ8h9E.png"> <img width="70%" alt="data_instance" src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/qWMf-xKmX17terdKxu9oa.png"> # 引用 @article{bai2024longbench2, title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li}, journal={arXiv preprint arXiv:2412.15204}, year={2024} }

提供机构：

maas

创建时间：

2025-07-30

搜集汇总

数据集介绍

背景与挑战

背景概述

LongBench-v2是一个专门设计用于评估大语言模型在长上下文场景下深度理解和推理能力的数据集。它包含503个具有挑战性的多项选择题，上下文长度从8k到2M单词，覆盖单文档问答、多文档问答、长上下文学习、长对话历史理解、代码仓库理解和长结构化数据理解等六个现实任务类别。数据来源于近100名高学历专业人士，人类专家在15分钟时间限制下的准确率仅为53.7%，突显了其高难度和实用性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集