THUDM/LongBench-v2
收藏Hugging Face2024-12-20 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/THUDM/LongBench-v2
下载链接
链接失效反馈官方服务:
资源简介:
LongBench v2是一个旨在评估大语言模型在处理需要深度理解和推理的长上下文问题上的能力的数据集。它包含503个具有挑战性的多项选择题,上下文长度从8k到2M单词不等,涵盖六个主要任务类别:单文档问答、多文档问答、长上下文学习、长对话历史理解、代码库理解和长结构化数据理解。数据集的难度较高,即使是人类专家在短时间内使用搜索工具也难以正确回答。数据集通过自动和手动审查过程确保高质量和高难度,人类专家在15分钟时间限制下的准确率仅为53.7%。最佳模型直接回答问题的准确率为50.1%,而包含更长推理的o1-preview模型准确率为57.7%,超过了人类基线4%。这些结果强调了增强推理能力和扩展推理时间计算在应对LongBench v2长上下文挑战中的重要性。
LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. The dataset is challenging enough that even human experts, using search tools within the document, cannot answer correctly in a short time. It ensures high quality and difficulty through both automated and manual review processes, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. The best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.
提供机构:
THUDM



