RussianNLP/rublimp

Name: RussianNLP/rublimp
Creator: RussianNLP
Published: 2025-04-26 06:26:22
License: 暂无描述

Hugging Face2025-04-26 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/RussianNLP/rublimp

下载链接

链接失效反馈

官方服务：

资源简介：

RuBLiMP数据集是一个用于俄语可接受性分类的基准数据集，包含多个配置，每个配置都有id、源句子、目标句子、源词、目标词、级别、现象、PID、子类型、领域和树深度等特征。数据集的语言为俄语，任务类型为可接受性分类，数据集大小在10K到100K之间。

The RuBLiMP dataset is a benchmark dataset for Russian acceptability classification tasks, containing multiple configurations, each with a set of features such as id, source sentence, target sentence, etc., and each configuration has a training set with 1000 examples.

提供机构：

RussianNLP

原始信息汇总

RuBLiMP 数据集概述

基本信息

许可证: Apache 2.0
语言: 俄语
标签: benchmark
任务类型: acceptability-classification
数据集名称: RuBLiMP
数据规模: 10K < n < 100K

数据集配置

配置名称: add_new_suffix

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 349051
  - num_examples: 1000
下载大小: 153218
数据集大小: 349051

配置名称: add_verb_prefix

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 325796
  - num_examples: 1000
下载大小: 139990
数据集大小: 325796

配置名称: adposition_government

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 333926
  - num_examples: 1000
下载大小: 146114
数据集大小: 333926

配置名称: anaphor_agreement_gender

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 497512
  - num_examples: 1000
下载大小: 205655
数据集大小: 497512

配置名称: anaphor_agreement_number

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 502871
  - num_examples: 1000
下载大小: 222157
数据集大小: 502871

配置名称: change_declension_ending

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 350376
  - num_examples: 1000
下载大小: 148612
数据集大小: 350376

配置名称: change_declension_ending_has_dep

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 400435
  - num_examples: 1000
下载大小: 164951
数据集大小: 400435

配置名称: change_duration_aspect

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 355088
  - num_examples: 1000
下载大小: 134065
数据集大小: 355088

配置名称: change_repetition_aspect

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 434479
  - num_examples: 1000
下载大小: 178290
数据集大小: 434479

配置名称: change_verb_conjugation

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 331430
  - num_examples: 1000
下载大小: 131965
数据集大小: 331430

配置名称: change_verb_prefixes_order

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 486936
  - num_examples: 1000
下载大小: 193967
数据集大小: 486936

配置名称: clause_subj_predicate_agreement_gender

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 382513
  - num_examples: 1000
下载大小: 123034
数据集大小: 382513

配置名称: clause_subj_predicate_agreement_number

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 382153
  - num_examples: 1000
下载大小: 122369
数据集大小: 382153

配置名称: clause_subj_predicate_agreement_person

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 406739
  - num_examples: 1000
下载大小: 133132
数据集大小: 406739

配置名称: conj_verb_tense

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 464440
  - num_examples: 1000
下载大小: 199995
数据集大小: 464440

配置名称: deontic_imperative_aspect

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 369950
  - num_examples: 1000
下载大小: 140645
数据集大小: 369950

配置名称: external_possessor

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 304621
  - num_examples: 1000
下载大小: 116558
数据集大小: 304621

配置名称: floating_quantifier_agreement_case

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 345416
  - num_examples: 1000
下载大小: 113129
数据集大小: 345416

配置名称: floating_quantifier_agreement_gender

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 362382
  - num_examples: 1000
下载大小: 121666
数据集大小: 362382

配置名称: floating_quantifier_agreement_number

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 423319
  - num_examples: 1000
下载大小: 162506
数据集大小: 423319

配置名称: genitive_subj_predicate_agreement_gender

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 368978
  - num_examples: 1000
下载大小: 115023
数据集大小: 368978

配置名称: genitive_subj_predicate_agreement_number

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 389125
  - num_examples: 1000
下载大小: 125194
数据集大小: 389125

配置名称: genitive_subj_predicate_agreement_person

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 398814
  - num_examples: 1000
下载大小: 127526
数据集大小: 398814

配置名称: indefinite_pronoun_to_negative

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:
  - num_bytes: 384859
  - num_examples: 1000
下载大小: 151220
数据集大小: 384859

配置名称: negative_concord

特征:
- id: int64
- source_sentence: string
- target_sentence: string
- source_word: string
- target_word: string
- level: string
- phenomenon: string
- PID: string
- subtype: string
- domain: string
- tree_depth: int64
分割:
- train:

搜集汇总

数据集介绍

构建方式

在计算语言学领域，RuBLiMP数据集通过系统化方法构建，旨在评估俄语语言模型的语法能力。该数据集采用人工生成与自动转换相结合的策略，针对俄语丰富的形态句法现象，设计了涵盖变格、一致关系、动词体态等核心语法范畴的对比句对。每个配置聚焦特定语法现象，通过精确控制源句与目标句之间的最小差异，确保数据点的纯净性与可解释性，为模型评估提供了结构化的语法探针。

特点

RuBLiMP数据集展现出高度的系统性与层次性，其核心特征在于覆盖了俄语语法体系中三十余种关键现象，如格一致、性数一致、动词体变换等。每个数据点均标注了现象类型、句法层级及树深等元信息，使得数据集不仅适用于可接受性分类任务，还能支持细粒度的语法错误分析与模型诊断。数据规模适中，配置独立，便于研究者针对特定语法维度进行深入探究。

使用方法

该数据集主要服务于俄语语言模型的语法能力评测，使用者可通过HuggingFace平台加载特定配置，获取包含源句、目标句及丰富语法标注的数据。典型应用场景包括训练或微调模型进行语法可接受性判断，亦可用于零样本或少样本评估，通过对比句对检验模型对俄语微妙语法差异的敏感性。数据以标准分割格式提供，可直接集成至现有评估流程中。

背景与挑战

背景概述

在自然语言处理领域，针对俄语等形态丰富语言的语法可接受性评估，长期缺乏系统性的基准数据集。RuBLiMP数据集由俄罗斯自然语言处理研究社群于近年构建，旨在填补这一空白。该数据集聚焦于俄语语法现象的精细分类，涵盖动词变位、名词变格、一致性约束等复杂语言结构，为评估语言模型对俄语语法规则的掌握程度提供了标准化测试平台。其构建工作凝聚了语言学与计算语言学的交叉智慧，推动了俄语NLP模型从表层统计向深层语法理解的演进，对提升模型的语言认知能力具有里程碑意义。

当前挑战

RuBLiMP数据集所针对的核心挑战在于系统评估语言模型对俄语复杂语法现象的敏感性，这要求模型超越词汇共现模式，深入理解形态句法规则。构建过程中的主要困难体现在语言现象的精细标注上，俄语丰富的屈折变化和一致性约束需依赖深厚的语言学知识进行准确分类与对齐。同时，确保数据集中各语法现象覆盖的平衡性与代表性，避免标注偏差干扰模型评估，亦是构建者面临的重要课题。这些挑战共同指向了如何建立可靠且全面的语法评估基准，以推动模型在形态丰富语言上的深层语言能力发展。

常用场景

经典使用场景

在计算语言学领域，RuBLiMP数据集作为俄语语言学现象的基准测试集，其经典使用场景聚焦于评估语言模型对俄语复杂语法结构的理解能力。该数据集通过系统化地构建包含变格、一致关系、体貌变化等语法现象的句子对，为研究者提供了衡量模型语法判断准确性的标准化工具。在自然语言处理研究中，它常被用于测试预训练模型在俄语语法可接受性分类任务上的表现，从而揭示模型对语言内部规则的掌握程度。

实际应用

在实际应用层面，RuBLiMP数据集为俄语智能语言工具的研发提供了关键支撑。基于该数据集训练的语法检查系统能够精准识别俄语文本中的形态句法错误，显著提升机器翻译、自动校对等应用的可靠性。在教育技术领域，它可作为自适应语言学习平台的核心组件，通过分析学习者产出文本的语法可接受性，提供个性化的俄语语法教学反馈。这些应用不仅增强了俄语自然语言处理技术的实用性，也促进了俄语数字教育资源的智能化发展。

衍生相关工作

围绕RuBLiMP数据集衍生的经典工作主要集中在多语言语法评估框架的拓展与模型能力诊断研究。例如，研究者将其与英语BLiMP基准结合，构建了跨语言语法评估体系，用以比较不同语言模型在形态丰富语言上的性能差异。同时，基于该数据集的细粒度分析催生了针对俄语特定语法现象的探针任务设计，这些工作深化了对Transformer架构语法编码机制的理解，并为开发更具语言适应性的预训练模型提供了方法论指导。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集