RussianNLP/rucola

Name: RussianNLP/rucola
Creator: RussianNLP
Published: 2024-07-15 09:58:56
License: 暂无描述

Hugging Face2024-07-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/RussianNLP/rucola

下载链接

链接失效反馈

官方服务：

资源简介：

俄语语言可接受性语料库（RuCoLA）是一个包含13.4k个句子的新颖基准数据集，这些句子被标记为可接受或不可接受。RuCoLA结合了从语言学文献中手动收集的领域内句子和由九个机器翻译和生成模型生成的领域外句子。领域外句子的目的是促进语言生成中可接受性判断的实际应用。每个不可接受的句子还标注了四个标准类别：形态学、句法学、语义学和幻觉。数据集的结构包括训练集、验证集和测试集，分别包含领域内和领域外的样本。数据集的创建过程包括两个阶段的注释过程，使用Toloka平台进行众包标注，并涉及语言学专业的学生进行精细标注。

The Russian Corpus of Linguistic Acceptability (RuCoLA) is a novel benchmark dataset consisting of 13.4k sentences labeled as acceptable or not. RuCoLA combines in-domain sentences manually collected from linguistic literature and out-of-domain sentences produced by nine machine translation and paraphrase generation models. The motivation behind the out-of-domain set is to facilitate the practical use of acceptability judgments for improving language generation. Each unacceptable sentence is additionally labeled with four standard categories: morphology, syntax, semantics, and hallucinations. The dataset structure includes training, validation, and test sets, each containing in-domain and out-of-domain samples. The dataset creation process involves a two-stage annotation procedure using the Toloka platform for crowdsourced labeling and involves students with a linguistic background for fine-grained annotation.

提供机构：

RussianNLP

原始信息汇总

数据集概述

数据集名称

RuCoLA: Russian Corpus of Linguistic Acceptability

数据集描述

RuCoLA是一个包含13.4k句子的基准数据集，标记为可接受或不可接受。该数据集结合了来自语言学文献的手动收集的领域内句子以及由九种机器翻译和释义生成模型产生的领域外句子。每个不可接受的句子还额外标记有四个标准和机器特定的粗粒度类别：形态学、句法、语义和幻觉。

数据集结构

任务：二元分类。
指标：MCC/Acc。
语言：俄语。

数据实例

json { "id": 19, "sentence": "Люк останавливает удачу от этого.", "label": 0, "error_type": "Hallucination", "detailed_source": "WikiMatrix" }

数据字段

id (int64): 句子ID。
sentence (str): 句子内容。
label (str): 目标类别，"1"表示"可接受"，"0"表示"不可接受"。
error_type (str): 粗粒度违反类别（形态学、句法、语义或幻觉）；如果句子可接受，则为"0"。
detailed_source: 数据来源。

数据分割

train: 7869个领域内样本。
validation: 2787个领域内和领域外样本。
test: 2789个领域内和领域外样本。

数据集创建

领域内子集：从基本语言学教科书、学术出版物和方法论材料中手动提取的句子和相应的作者可接受性判断。
领域外子集：由九种开源MT和释义生成模型产生的句子。

注释过程

阶段1：可接受性判断
阶段2：违反类别

许可证

Apache-2.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集