RuCoLA

github2023-03-28 更新2024-05-31 收录

下载链接：

https://github.com/RussianNLP/RuCoLA

下载链接

链接失效反馈

官方服务：

资源简介：

RuCoLA是一个包含俄语句子及其二元可接受性判断的数据集，包括来自语言出版物的专家编写句子和机器生成的例子。数据集覆盖了从句法和语义到生成模型幻觉的多种语言现象，旨在促进自然语言错误识别方法的发展，并设立了公共排行榜以追踪此问题的进展。

RuCoLA is a dataset comprising Russian sentences and their binary acceptability judgments, including expert-written sentences sourced from linguistic publications and machine-generated examples. The dataset covers a wide spectrum of linguistic phenomena ranging from syntactic and semantic aspects to generative model hallucinations. It aims to advance the development of natural language error identification methods, and has established a public leaderboard to track progress in this research area.

创建时间：

2022-05-11

原始信息汇总

RuCoLA 数据集概述

数据集描述

RuCoLA（Russian Corpus of Linguistic Acceptability）是一个包含俄语句子的数据集，这些句子带有二元可接受性判断。数据集包括来自语言出版物的专家编写句子和机器生成的例子。

数据集内容

来源类型：
- 专家编写句子：来自语言出版物，涵盖从句法和语义到生成模型幻觉的多种语言现象。
- 机器生成句子：由多种模型生成，包括但不限于Tatoeba、WikiMatrix、TED、YandexCorpus等。

数据集用途

RuCoLA旨在促进识别自然语言中错误的方法的发展，并创建一个公共排行榜（https://rucola-benchmark.com/）以跟踪此问题的进展。

数据集可用性

数据集可在HuggingFace和本仓库的data/文件夹中获取。

数据集来源

专家编写句子来源：
- 包括但不限于“Проект корпусного описания русской грамматики”、“Тестелец, Я.Г.”、“Лютикова, Е.А.”等。
机器生成句子来源：
- 包括Tatoeba、WikiMatrix、TED、YandexCorpus等数据集，以及EasyNMT和Paraphrase generation模型。

引用信息

若在研究中使用RuCoLA，请使用以下引用信息：

@inproceedings{mikhailov-etal-2022-rucola, title = "{R}u{C}o{LA}: {R}ussian Corpus of Linguistic Acceptability", author = "Mikhailov, Vladislav and Shamardina, Tatiana and Ryabinin, Max and Pestova, Alena and Smurov, Ivan and Artemova, Ekaterina", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.348", pages = "5207--5227", abstract = "Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers.However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources.To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation.Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches.In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard to assess the linguistic competence of language models for Russian.", }

搜集汇总

数据集介绍

构建方式

RuCoLA数据集的构建基于语言学出版物和机器生成句子的结合，涵盖了从语法到语义的多种语言现象。数据集的构建过程包括从语言学文献中提取专家撰写的句子，以及通过生成模型生成的句子，确保数据的多样性和广泛性。此外，数据集还特别关注生成模型中的幻觉现象，旨在为自然语言错误识别方法的发展提供支持。

使用方法

RuCoLA数据集的使用方法包括通过Hugging Face平台获取数据，并利用提供的基线模型进行可接受性分类实验。用户可以通过运行提供的Python脚本训练和评估模型，如scikit-learn模型、ruBERT、ruT5等。此外，数据集还提供了预训练模型的检查点，用户可以直接应用于实际任务中，无需从头训练。数据集的使用旨在推动俄语语言学可接受性研究，并为生成模型的改进提供数据支持。

背景与挑战

背景概述

RuCoLA（Russian Corpus of Linguistic Acceptability）是由俄罗斯自然语言处理领域的研究人员于2022年创建的一个专注于俄语语言可接受性判断的数据集。该数据集由来自语言学文献的专家撰写的句子以及机器生成的句子组成，涵盖了从句法、语义到生成模型幻觉等多种语言现象。RuCoLA的创建旨在推动自然语言错误识别方法的发展，并为俄语语言模型的语言能力评估提供基准。该数据集由俄罗斯NLP团队主导开发，并在EMNLP 2022会议上发表，成为俄语自然语言处理领域的重要资源之一。

当前挑战

RuCoLA数据集的核心挑战在于如何准确评估俄语句子的语言可接受性，尤其是在面对形态学和语义错误时，现有的语言模型与人类表现仍存在显著差距。此外，数据集的构建过程也面临诸多挑战，包括如何从广泛的语言学文献中提取高质量的句子，以及如何生成具有代表性的机器生成句子以覆盖多样化的语言现象。这些挑战不仅要求数据集的构建者具备深厚的语言学知识，还需要在数据标注和模型评估中保持高度的严谨性，以确保数据集的可靠性和实用性。

常用场景

经典使用场景

RuCoLA数据集在自然语言处理领域中被广泛应用于语言可接受性（Linguistic Acceptability, LA）的研究。该数据集通过提供俄语句子及其二元可接受性判断，为研究者提供了一个评估语言模型在语法、语义等方面表现的标准工具。特别是在生成模型幻觉检测和错误识别方面，RuCoLA为开发更精确的语言模型提供了重要支持。

解决学术问题

RuCoLA数据集解决了语言模型在俄语环境下的可接受性评估问题。通过结合专家撰写的句子和机器生成的例子，该数据集为研究者提供了一个全面的基准，用于测试语言模型在语法、语义和形态学等方面的能力。这一数据集的出现填补了俄语语言可接受性研究的空白，推动了多语言自然语言处理技术的发展。

实际应用

在实际应用中，RuCoLA数据集被用于开发语言可接受性分类器，这些分类器可以用于过滤生成文本中的不合理内容，提升机器翻译、文本生成等任务的质量。此外，该数据集还被用于评估和改进俄语语言模型的性能，特别是在处理复杂语法结构和语义错误时的表现。

数据集最近研究