RuCoLA
收藏RuCoLA 数据集概述
数据集描述
RuCoLA(Russian Corpus of Linguistic Acceptability)是一个包含俄语句子的数据集,这些句子带有二元可接受性判断。数据集包括来自语言出版物的专家编写句子和机器生成的例子。
数据集内容
- 来源类型:
- 专家编写句子:来自语言出版物,涵盖从句法和语义到生成模型幻觉的多种语言现象。
- 机器生成句子:由多种模型生成,包括但不限于Tatoeba、WikiMatrix、TED、YandexCorpus等。
数据集用途
RuCoLA旨在促进识别自然语言中错误的方法的发展,并创建一个公共排行榜(https://rucola-benchmark.com/)以跟踪此问题的进展。
数据集可用性
数据集可在HuggingFace和本仓库的data/文件夹中获取。
数据集来源
- 专家编写句子来源:
- 包括但不限于“Проект корпусного описания русской грамматики”、“Тестелец, Я.Г.”、“Лютикова, Е.А.”等。
- 机器生成句子来源:
- 包括Tatoeba、WikiMatrix、TED、YandexCorpus等数据集,以及EasyNMT和Paraphrase generation模型。
引用信息
若在研究中使用RuCoLA,请使用以下引用信息:
@inproceedings{mikhailov-etal-2022-rucola, title = "{R}u{C}o{LA}: {R}ussian Corpus of Linguistic Acceptability", author = "Mikhailov, Vladislav and Shamardina, Tatiana and Ryabinin, Max and Pestova, Alena and Smurov, Ivan and Artemova, Ekaterina", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.348", pages = "5207--5227", abstract = "Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers.However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources.To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation.Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches.In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard to assess the linguistic competence of language models for Russian.", }




