five

hyg444/LongBench

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/hyg444/LongBench
下载链接
链接失效反馈
官方服务:
资源简介:
LongBench是首个用于双语、多任务和全面评估大型语言模型长上下文理解能力的基准测试。LongBench包含不同语言(中文和英文)的任务,以更全面地评估大模型在长上下文下的多语言能力。此外,LongBench由六大类和二十一个不同任务组成,涵盖了单文档问答、多文档问答、摘要、少样本学习、合成任务和代码补全等关键长文本应用场景。我们充分意识到模型评估过程中可能涉及的高成本,尤其是在长上下文场景下(如人工标注成本或API调用成本)。因此,我们采用全自动化的评估方法,旨在以最低成本衡量和评估模型的长上下文理解能力。LongBench包含14个英文任务、5个中文任务和2个代码任务,大多数任务的平均长度在5k到15k之间,总共有4,750个测试数据。

LongBench is the first benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion. We are fully aware of the potentially high costs involved in the model evaluation process, especially in the context of long context scenarios (such as manual annotation costs or API call costs). Therefore, we adopt a fully automated evaluation method, aimed at measuring and evaluating the models ability to understand long contexts at the lowest cost. LongBench includes 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the average length of most tasks ranging from 5k to 15k, and a total of 4,750 test data.
提供机构:
hyg444
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作