LIBRA

Name: LIBRA
Creator: maas
Published: 2026-05-16 20:05:17
License: 暂无描述

魔搭社区2026-05-16 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/ai-forever/LIBRA

下载链接

链接失效反馈

官方服务：

资源简介：

# LIBRA: Long Input Benchmark for Russian Analysis <p align="center"> <img src="logo.png" width="500" /> </p> ## LIBRA LIBRA (Long Input Benchmark for Russian Analysis) is designed to evaluate the capabilities of large language models (LLMs) in understanding and processing long texts in Russian. This benchmark includes 18 datasets adapted for different tasks and complexities. The tasks are divided into four complexity groups and allow evaluation across various context lengths ranging from 4k up to 512k tokens. For model comparison and results, see the [LIBRA Leaderboard](https://huggingface.co/spaces/ai-forever/LIBRA-Leaderboard). The benchmark is described in detail in our [paper](https://arxiv.org/abs/2408.02439). **NOTE:** This is a new benchmark version released in May 2026. The original version (described in the paper) can be found [here](https://huggingface.co/datasets/ai-forever/LIBRA_old). We strongly encourage using the current version as it contains cleaned and extended datasets with ensured data quality. ## LIBRA Mini Running a full LIBRA evaluation can be prohibitively expensive and time-consuming due to the large number of datasets and long context lengths involved. Moreover, some of the included tasks have become less informative as benchmarks, with modern models achieving near-saturated scores on them. To address this, we introduce **LIBRA Mini** — a compact, curated subset of 6 datasets selected from the full benchmark. These datasets represent the most challenging and diagnostically informative tasks in LIBRA, covering diverse task types and complexity levels (see [Task Description](#task-description) for more information): - **ruBABILongQA3** — multi-fact reasoning over long contexts - **ruSciPassageCount** — counting unique paragraphs in extended scientific texts - **LibrusecMHQA** — multi-hop QA with information spread across multiple text parts - **LongContextMultiQ** — multi-hop QA based on Wikidata and Wikipedia - **ru2WikiMultihopQA** — multi-hop reasoning across multiple Wikipedia articles - **MatreshkaNames** — identifying persons in dialogues based on discussed topics LIBRA Mini uses the same **Exact Match (EM)** metric and evaluation methodology as the full benchmark. Results for LIBRA Mini are reported in a dedicated section on the [leaderboard](https://huggingface.co/spaces/ai-forever/LIBRA-Leaderboard). We recommend **using LIBRA Mini as the primary evaluation suite for model comparisons**, while the full LIBRA benchmark remains available for comprehensive analysis. ## Dataset Structure <p align="center"> <img src="libra_structure.svg" width="600" /> </p> The datasets are divided into subsets based on context lengths. The table below shows the number of examples per context length for each dataset. Datasets included in **LIBRA Mini** are highlighted in bold. Note that not all datasets cover the full range of context lengths — some are designed for specific length ranges that best suit their task type. | **Task** | **4k** | **8k** | **16k** | **32k** | **64k** | **128k** | **256k** | **512k** | | **Total** | |:---|---:|---:|---:|---:|---:|---:|---:|---:|---|---:| | *— Group I —* | | | | | | | | | | | | passkey | 200 | 200 | 200 | 200 | 200 | 200 | 200 | 200 | | **1600** | | passkey_with_librusec | 200 | 200 | 200 | 200 | 200 | 200 | 200 | 200 | | **1600** | | *— Group II —* | | | | | | | | | | | | librusec_history | - | 32 | 32 | 32 | 32 | - | - | - | | **128** | | **matreshka_names** | 150 | 150 | 145 | 150 | 45 | - | - | - | | **640** | | matreshka_yes_no | 300 | 300 | 300 | 300 | 290 | 280 | - | - | | **1770** | | ru_quality | - | 18 | 184 | - | - | - | - | - | | **202** | | ru_sci_abstract_retrieval | 209 | 210 | 210 | 206 | 185 | 200 | 200 | - | | **1420** | | ru_sci_fi | - | - | - | 216 | 213 | - | - | - | | **429** | | ru_tpo | - | 900 | - | - | - | - | - | - | | **900** | | *— Group III —* | | | | | | | | | | | | **librusec_mhqa** | - | 384 | - | - | - | - | - | - | | **384** | | **long_context_multiq** | 158 | 121 | 83 | 109 | 41 | - | - | - | | **512** | | **ru_2wikimultihopqa** | - | 147 | 384 | 369 | - | - | - | - | | **900** | | ru_babilong_qa1 | 99 | 99 | 99 | 94 | 91 | 99 | 200 | 98 | | **879** | | ru_babilong_qa2 | 85 | 76 | 77 | 73 | 65 | 99 | 200 | 98 | | **773** | | **ru_babilong_qa3** | 60 | 68 | 69 | 65 | 65 | 100 | 198 | 98 | | **723** | | ru_babilong_qa4 | 78 | 85 | 83 | 87 | 75 | 99 | 200 | 98 | | **805** | | ru_babilong_qa5 | 99 | 99 | 98 | 96 | 96 | 99 | 200 | 98 | | **885** | | *— Group IV —* | | | | | | | | | | | | **ru_sci_passage_count** | 104 | 99 | 100 | 87 | 91 | 90 | 43 | 60 | | **674** | ## Task Description The benchmark tasks are organized into four complexity groups, ranging from simple retrieval to complex reasoning. The grouping reflects both the cognitive difficulty of the task and the degree to which models must integrate information across the full context window. Group I serves as a basic sanity check, verifying that a model can process long inputs at all. Groups II and III progressively require deeper language understanding, multi-step reasoning, and the ability to locate and combine information from distant parts of the context. Group IV represents the most demanding tasks, requiring complex reasoning that goes beyond standard question answering formats. The total score on the leaderboard is computed across all four groups. ### Group I: Simple Information Retrieval (sanity check) This group includes the most simple tasks which serve as a sanity check for models to work with such amount of tokens. - **Passkey**: Extract a relevant piece of code number from a long text fragment. Based on the original [PassKey test](https://github.com/CStanKonrad/long_llama/blob/main/examples/passkey.py) from the LongLLaMA's GitHub repo. - **PasskeyWithLibrusec**: Similar to Passkey but with added noise from Librusec texts. ### Group II: Question Answering and Multiple Choice This group consists of standard QA and multiple choice tasks adapted for the long-context setting. - **MatreshkaNames**: Identify the person in dialogues based on the discussed topic. We used [Matreshka](https://huggingface.co/datasets/zjkarina/matreshka) dataset and [Russian Names](https://www.kaggle.com/datasets/rai220/russian-cyrillic-names-and-sex/data) dataset to create this and the next task. - **MatreshkaYesNo**: Indicate whether a specific topic was mentioned in the dialog. - **LibrusecHistory**: Answer questions based on historical texts. Ideologically similar to the [PassageRetrieval dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_retrieval_en) from LongBench. - **ruSciFi**: Answer true/false based on context and general world knowledge. Translation of [SciFi dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/sci_f) from L-Eval which originally was based on [SF-Gram](https://github.com/nschaetti/SFGram-dataset). - **ruSciAbstractRetrieval**: Retrieve relevant paragraphs from scientific abstracts. - **ruTPO**: Multiple-choice questions similar to TOEFL exams. Translation of the [TPO dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/tpo) from L-Eval. - **ruQuALITY**: Multiple-choice QA tasks based on detailed texts. Created by translating the [QuALITY dataset](https://huggingface.co/datasets/L4NLP/LEval/viewer/quality) from L-Eval. ### Group III: Multi-hop Question Answering This group includes long-context multi-hop QA problems where the answer requires combining multiple pieces of information distributed across the context. - **ruBABILongQA (1-5)**: 5 long-context reasoning tasks for QA using facts hidden among irrelevant information. - **LongContextMultiQ**: Multi-hop QA based on Wikidata and Wikipedia. - **LibrusecMHQA**: Multi-hop QA requiring information distributed across several text parts. - **ru2WikiMultihopQA**: Translation of the [2WikiMultihopQA dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e) from LongBench. ### Group IV: Complex Reasoning and Mathematical Problems This group includes the most complex long-context tasks, which span beyond multiple-choice and multi-hop QA. At this point Group IV comprises only one task and we invite the community to contribute to it. - **ruSciPassageCount**: Count unique paragraphs in a long text. Uses the basic idea of the original [PassageCount dataset](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_count) from LongBench. ## Metrics We use **Exact Match (EM)** as a primary metric for all tasks. **EM** is used to evaluate the accuracy of the model's responses by comparing the predicted answers to the ground truth. ## Changes from the Original Version This version of LIBRA includes both automatic and manual improvements over the original release. All datasets underwent automatic quality filtering to ensure consistency and reliability of annotations. In addition, several datasets were manually revised and extended with the help of human annotators: **LibrusecMHQA**, **LongContextMultiQ**, **MatreshkaNames**, **MatreshkaYesNo**, **ru2WikiMultihopQA**, **ruSciFi**, and **ruTPO** received targeted corrections and additional examples. The datasets **ruGSM100** and **ruQasper** were removed from the benchmark as they did not meet the updated quality criteria. The maximum supported context length has been extended from 128k to 512k tokens. The following datasets now include examples at longer context lengths not present in the original version: **ruBABILongQA (1–5)**, **ruSciAbstractRetrieval**, **ruSciPassageCount**, **LongContextMultiQ**, **MatreshkaYesNo**, **Passkey**, and **PasskeyWithLibrusec**. ## Evaluation Starting from this version, LIBRA supports evaluation via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) — a widely adopted framework for standardized LLM evaluation. Both full **LIBRA** and compact **LIBRA Mini** evaluations are supported. To get started: ```bash pip install lm-eval[vllm] ``` Run evaluation on the full LIBRA benchmark: ```bash lm_eval --model vllm \ --model_args pretrained=Qwen/Qwen3-30B-A3B,max_model_len=262144 \ --tasks libra \ --apply_chat_template \ --device cuda:0 ``` Run evaluation on LIBRA Mini only: ```bash lm_eval --model vllm \ --model_args pretrained=Qwen/Qwen3-30B-A3B,max_model_len=262144 \ --tasks libra_mini \ --apply_chat_template \ --device cuda:0 ``` For the full list of configuration options and instructions on adding new models, please refer to the [lm-evaluation-harness documentation](https://github.com/EleutherAI/lm-evaluation-harness). ## Citation ``` @misc{churin2024longinputbenchmarkrussian, title={Long Input Benchmark for Russian Analysis}, author={Igor Churin and Murat Apishev and Maria Tikhonova and Denis Shevelev and Aydar Bulatov and Yuri Kuratov and Sergei Averkiev and Alena Fenogenova}, year={2024}, eprint={2408.02439}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.02439}, } ```

# LIBRA：俄语长输入分析基准测试集 <img src="logo.png" width="800" /> ## 数据集概述 LIBRA（俄语长输入分析基准测试集，Long Input Benchmark for Russian Analysis）专为评估大语言模型（Large Language Model，LLM）理解与处理俄语长文本的能力而打造。该基准包含21个适配不同任务与复杂度的数据集。任务被划分为四个复杂度组别，支持在4k至128k Token的多种上下文长度下开展评估。 ## 数据集结构数据集按照上下文长度划分为子集：4k、8k、16k、32k、64k及128k Token。每个子集包含的样本数量因任务复杂度而异。 ## 任务与复杂度组别 <img src="table.png" width="800" /> ### 第一组：简单信息检索 - **密钥提取（Passkey）**：从长文本片段中提取相关的代码编号。该任务基于LongLLaMA的GitHub仓库中的原始[PassKey测试](https://github.com/CStanKonrad/long_llama/blob/main/examples/passkey.py)。 - **带Librusec噪声的密钥提取（PasskeyWithLibrusec）**：与密钥提取任务类似，但添加了来自Librusec文本的噪声干扰。 ### 第二组：问答与多项选择 - **MatreshkaNames**：根据对话讨论的主题识别对话中的人物。我们使用[Matreshka数据集](https://huggingface.co/datasets/zjkarina/matreshka)与[俄语西里尔字母姓名数据集](https://www.kaggle.com/datasets/rai220/russian-cyrillic-names-and-sex/data)构建了该任务与下一项任务。 - **MatreshkaYesNo**：判断对话中是否提及了特定主题。 - **Librusec历史问答（LibrusecHistory）**：基于历史文本回答问题。其设计理念与LongBench中的[段落检索数据集](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_retrieval_en)相似。 - **ruTREC**：用于主题分类的少样本上下文学习任务。该任务通过翻译LongBench中的[TREC数据集](https://huggingface.co/datasets/THUDM/LongBench/viewer/trec_e)构建而成。 - **ruSciFi**：基于上下文与通用世界知识判断正误。该任务是对L-Eval中的[SciFi数据集](https://huggingface.co/datasets/L4NLP/LEval/viewer/sci_f)的翻译，其原始版本源自[SF-Gram数据集](https://github.com/nschaetti/SFGram-dataset)。 - **ruSciAbstractRetrieval**：从科学摘要中检索相关段落。 - **ruTPO**：与托福考试类似的多项选择题任务。该任务是对L-Eval中的[TPO数据集](https://huggingface.co/datasets/L4NLP/LEval/viewer/tpo)的翻译。 - **ruQuALITY**：基于详细文本的多项选择问答任务。该任务通过翻译L-Eval中的[QuALITY数据集](https://huggingface.co/datasets/L4NLP/LEval/viewer/quality)构建而成。 ### 第三组：多跳问答 - **ruBABILongQA**：包含5项长上下文推理问答任务，需从混杂无关信息的文本中提取隐藏的事实以完成作答。 - **LongContextMultiQ**：基于维基数据与维基百科的多跳问答任务。 - **LibrusecMHQA**：需要从文本多个分散部分整合信息的多跳问答任务。 - **ru2WikiMultihopQA**：LongBench中[2WikiMultihopQA数据集](https://huggingface.co/datasets/THUDM/LongBench/viewer/2wikimqa_e)的俄语翻译版本。 ### 第四组：复杂推理与数学问题 - **ruSciPassageCount**：统计长文本中的唯一段落数量。该任务的核心思路源自LongBench中的[段落计数数据集](https://huggingface.co/datasets/THUDM/LongBench/viewer/passage_count)。 - **ruQasper**：针对学术研究论文的问答任务。该任务通过翻译LongBench中的[Qasper数据集](https://huggingface.co/datasets/THUDM/LongBench/viewer/qasper_e)构建而成。 - **ruGSM100**：需通过思维链（Chain-of-Thought）推理解决数学问题。该任务是对L-Eval中的[GSM100数据集](https://huggingface.co/datasets/L4NLP/LEval/viewer/gsm100)的俄语翻译版本。 ## 评估指标我们在测试中使用了**精确匹配（Exact Match，EM）**与**F1值**两项指标。**精确匹配（EM）**指标通过将模型预测答案与标准答案进行比对，评估模型回复的准确性，尤其适用于对回复精准匹配要求较高的任务，例如问答与检索任务。 <img src="table2.png" width="800" /> ## 引用 @misc{churin2024longinputbenchmarkrussian, title={Long Input Benchmark for Russian Analysis}, author={Igor Churin and Murat Apishev and Maria Tikhonova and Denis Shevelev and Aydar Bulatov and Yuri Kuratov and Sergei Averkiev and Alena Fenogenova}, year={2024}, eprint={2408.02439}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.02439}, } ## GitHub 仓库如需了解更多细节与代码，请访问我们的[GitHub仓库](https://github.com/ai-forever/LIBRA/).

提供机构：

maas

创建时间：

2025-05-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集