spellcheck_punctuation_benchmark
收藏魔搭社区2025-07-03 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/ai-forever/spellcheck_punctuation_benchmark
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Russian Spellcheck Punctuation Benchmark
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** [SAGE](https://github.com/ai-forever/sage)
- **Paper:** [EACL 2024 paper](https://aclanthology.org/2024.findings-eacl.10/)
- **Point of Contact:** nikita.martynov.98@list.ru
### Dataset Summary
The collection is an updated version of [Russian Spellcheck Benchmark](https://huggingface.co/datasets/ai-forever/spellcheck_benchmark) with punctuation corrected.
The Benchmark includes four datasets, each of which consists of pairs of sentences in Russian language.
Each pair embodies sentence, which may contain spelling and punctuation errors, and its corresponding correction.
Datasets were gathered from various sources and domains including social networks, internet blogs, github commits, medical anamnesis, literature, news, reviews and more.
All datasets were passed through two-stage manual labeling pipeline.
The correction of a sentence is defined by an agreement of at least two human annotators.
Manual labeling scheme accounts for jargonisms, collocations and common language, hence in some cases it encourages
annotators not to amend a word in favor of preserving style of a text.
The latter does not apply to punctuation. Punctuation signs are rigorously marked in accordance to the rules of the Russian punctuation system.
### Supported Tasks and Leaderboards
- **Task:** automatic spelling correction.
- **Metrics:** https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf.
- **ERRANT:** https://github.com/chrisjbryant/errant.
### Languages
Russian.
## Dataset Structure
### Data Instances
#### RUSpellRU
- **Size of downloaded dataset files:** 3.65 Mb
- **Size of the generated dataset:** 1.31 Mb
- **Total amount of disk used:** 4.96 Mb
An example of "train" / "test" looks as follows
```
{
"source": "давольно милый и летом и зимой обогреваемый теплым солнушком",
"correction": "Довольно милый, и летом, и зимой обогреваемый тёплым солнышком.",
}
```
#### MultidomainGold
- **Size of downloaded dataset files:** 15.03 Mb
- **Size of the generated dataset:** 5.43 Mb
- **Total amount of disk used:** 20.46 Mb
An example of "test" looks as follows
```
{
"source": "для меня всё материальное тленно и лишь находясь в гармонии-для начала с собой-можно радовацца чужому счастью искренне",
"correction": "Для меня всё материальное тленно, и лишь находясь в гармонии - для начала с собой - можно радоваться чужому счастью искренне.",
"domain": "web",
}
```
#### MedSpellcheck
- **Size of downloaded dataset files:** 1.49 Mb
- **Size of the generated dataset:** 0.54 Mb
- **Total amount of disk used:** 2.03 Mb
An example of "test" looks as follows
```
{
"source": "Накануне (18.02.2012 г",
"correction": "Накануне (18.02.2012 г.).",
}
```
#### GitHubTypoCorpusRu
- **Size of downloaded dataset files:** 1.23 Mb
- **Size of the generated dataset:** 0.48 Mb
- **Total amount of disk used:** 1.71 Mb
An example of "test" looks as follows
```
{
"source": "text: Пожалуйста выберите чат, чтобы начать общение",
"correction": "text: Пожалуйста, выберите чат, чтобы начать общение.",
}
```
### Data Fields
#### RUSpellRU
- `source`: a `string` feature
- `correction`: a `string` feature
- `domain`: a `string` feature
#### MultidomainGold
- `source`: a `string` feature
- `correction`: a `string` feature
- `domain`: a `string` feature
#### MedSpellcheck
- `source`: a `string` feature
- `correction`: a `string` feature
- `domain`: a `string` feature
#### GitHubTypoCorpusRu
- `source`: a `string` feature
- `correction`: a `string` feature
- `domain`: a `string` feature
### Data Splits
#### RUSpellRU
| |train|test|
|---|---:|---:|
|RUSpellRU|2000|2008|
#### MultidomainGold
| |train|test|
|---|---:|---:|
|web|385|756|
|news|361|245|
|social_media|430|200|
|reviews|583|585|
|subtitles|1810|1810|
|strategic_documents|-|250|
|literature|-|260|
#### MedSpellcheck
| |test|
|---|---:|
|MedSpellcheck|1054|
#### GitHubTypoCorpusRu
| |test|
|---|---:|
|GitHubTypoCorpusRu|868|
## Dataset Creation
### Source Data
#### Initial Data Collection and Normalization
The datasets are chosen in accordance with the specified criteria.
First, domain variation: half of the datasets are chosen from different domains to ensure diversity, while the remaining half are from a single domain.
Another criterion is presence of spelling orthographic and punctuation mistakes:
the datasets exclusively comprised mistyping, omitting grammatical or more complex errors of nonnative speakers.
- **RUSpellRU**: texts collected from ([LiveJournal](https://www.livejournal.com/media)), with manually corrected typos and errors;
- **MultidomainGold**: examples from several text sources including the open web, news, social media, reviews, subtitles, policy documents and literary works were collected:
*Aranea web-corpus* is a family of multilanguage gigaword web-corpora collected from Internet resources. The texts in the corpora are evenly distributed across periods, writing styles and topics they cover. We randomly picked the sentences from Araneum Russicum, which is harvested from the Russian part of the web.
*Literature* is a collection of Russian poems and prose of different classical literary works. We randomly picked sentences from the source dataset that were gathered from Ilibrary, LitLib, and Wikisource.
*News*, as the name suggests, covers news articles on various topics such as sports, politics, environment, economy etc. The passages are randomly picked from the summarization dataset Gazeta.ru.
*Social media* is the text domain from social media platforms marked with specific hashtags. These texts are typically short, written in an informal style and may contain slang, emojis and obscene lexis.
*Strategic Documents* is part of the dataset the Ministry of Economic Development of the Russian Federation collected. Texts are written in a bureaucratic manner, rich in embedded entities, and have complex syntactic and discourse structures. The full version of the dataset has been previously used in the RuREBus shared task.
- **MedSpellChecker**: texts with errors from medical anamnesis;
- **GitHubTypoCorpusRu**: spelling errors and typos in commits from [GitHub](https://github.com);
### Annotations
#### Annotation process
We set up two-stage annotation project via a crowd-sourcing platform Toloka:
1. Data gathering stage: we provide the texts with possible mistakes to annotators and ask them to write the sentence correctly;
2. Validation stage: we provide annotators with the pair of sentences (source and its corresponding correction from the previous stage) and ask them to check if the correction is right.
We prepared instructions for annotators for each task. The instructions ask annotators to correct misspellings if it does not alter the original style of the text.
Instructions do not provide rigorous criteria on the matter of distinguishing the nature of an error in terms of its origin - whether it came from an urge to endow a sentence with particular stylistic features or from unintentional spelling violation since it is time-consuming and laborious to describe every possible case of employing slang, dialect, collo- quialisms, etc. instead of proper language. Instructions also do not distinguish errors that come from the geographical or social background of the source. Instead, we rely on annotators’ knowledge and understanding of a language since, in this work, the important factor is to preserve the original style of the text.
To ensure we receive qualified expertise, we set up test iteration on a small subset of the data for both stages. We manually validated the test results and selected annotators, who processed at least six samples (2% of the total test iteration) and did not make a single error. After test iteration, we cut 85% and 86% of labellers for gathering and validation stages.
We especially urge annotators to correct mistakes associated with the substitution of the letters "ё" "й" and "щ" for corresponding "е" "и" and "ш" and not to explain abbreviations and correct punctuation errors. Each annotator is also warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion).
The annotation of punctuation errors has been done in one iteration considering the low variation and difficulty of the task (relative to spelling correction). The annotators have been asked to correct punctuation signs in accordance with the rules of the Russian punctuation system.
#### Who are the annotators?
Native Russian speakers who passed the language exam.
The annotators for punctuation errors are also professional editors and linguists.
## Considerations for Using the Data
### Discussion of Biases
We clearly state our work’s aims and
implications, making it open source and transparent. The data will be available under a public license. As our research involved anonymized textual data, informed consent from human participants was not required. However, we obtained permission to access publicly available datasets and
ensured compliance with any applicable terms of
service or usage policies.
### Other Known Limitations
The data used in our research may be limited to specific
domains, preventing comprehensive coverage of
all possible text variations. Despite these limitations, we tried to address the issue of data diversity
by incorporating single-domain and multi-domain
datasets in the proposed research. This approach
allowed us to shed light on the diversity and variances within the data, providing valuable insights
despite the inherent constraints.
We primarily focus on the Russian language. Further
research is needed to expand the datasets for a wider
range of languages.
## Additional Information
### Future plans
We are planning to expand our benchmark with both new Russian datasets and datasets in other languages including (but not limited to) European and CIS languages.
If you would like to contribute, please contact us.
### Dataset Curators
Nikita Martynov nikita.martynov.98@list.ru
### Licensing Information
All our datasets are published by MIT License.
### Citation Information
```
@inproceedings{martynov2023augmentation,
title={Augmentation methods for spelling corruptions},
author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena},
booktitle={Proceedings of the International Conference “Dialogue},
volume={2023},
year={2023}
}
@inproceedings{martynov-etal-2024-methodology,
title = "A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages",
author = "Martynov, Nikita and
Baushenko, Mark and
Kozlova, Anastasia and
Kolomeytseva, Katerina and
Abramov, Aleksandr and
Fenogenova, Alena",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-eacl.10",
pages = "138--155",
abstract = "Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models{'} pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models{'} architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).",
}
```
# 俄语拼写检查标点基准数据集卡片(Russian Spellcheck Punctuation Benchmark)
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [标注依据](#curation-rationale)
- [源数据](#source-data)
- [标注流程](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏见讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **仓库:** [SAGE(Spell checking via Augmentation and Generative distribution Emulation)](https://github.com/ai-forever/sage)
- **论文:** [EACL 2024 论文](https://aclanthology.org/2024.findings-eacl.10/)
- **联系方式:** nikita.martynov.98@list.ru
### 数据集概述
本数据集是[俄语拼写检查基准数据集(Russian Spellcheck Benchmark)](https://huggingface.co/datasets/ai-forever/spellcheck_benchmark)的更新版本,针对标点符号进行了修正。
该基准包含四个数据集,每个数据集均由俄语句子对组成。每一组句子对分别包含一个可能存在拼写与标点错误的源句,以及其对应的修正句。数据集采集自多个来源与领域,包括社交网络、互联网博客、GitHub提交、病历(medical anamnesis)、文学作品、新闻、评论等。
所有数据集均经过两阶段人工标注流程。句子的修正结果需至少经过两名人工标注者的一致确认。人工标注规则兼顾了行话、固定搭配与通用语言的使用场景,因此在部分情况下,标注者无需修改词语,以保留文本的原有风格。但上述规则不适用于标点符号:标点符号需严格按照俄语标点系统的规则进行修正。
### 支持任务与排行榜
- **任务:** 自动拼写校正。
- **评估指标:** https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf。
- **ERRANT工具:** https://github.com/chrisjbryant/errant。
### 语言
俄语。
## 数据集结构
### 数据实例
#### RUSpellRU
- **下载数据集文件大小:** 3.65 Mb
- **生成数据集大小:** 1.31 Mb
- **总磁盘占用:** 4.96 Mb
「train」/「test」划分的示例如下:
{
"source": "давольно милый и летом и зимой обогреваемый теплым солнушком",
"correction": "Довольно милый, и летом, и зимой обогреваемый тёплым солнышком.",
}
#### MultidomainGold
- **下载数据集文件大小:** 15.03 Mb
- **生成数据集大小:** 5.43 Mb
- **总磁盘占用:** 20.46 Mb
「test」划分的示例如下:
{
"source": "для меня всё материальное тленно и лишь находясь в гармонии-для начала с собой-можно радовацца чужому счастью искренне",
"correction": "Для меня всё материальное тленно, и лишь находясь в гармонии - для начала с собой - можно радоваться чужому счастью искренне.",
"domain": "web",
}
#### MedSpellcheck
- **下载数据集文件大小:** 1.49 Mb
- **生成数据集大小:** 0.54 Mb
- **总磁盘占用:** 2.03 Mb
「test」划分的示例如下:
{
"source": "Накануне (18.02.2012 г",
"correction": "Накануне (18.02.2012 г.).",
}
#### GitHubTypoCorpusRu
- **下载数据集文件大小:** 1.23 Mb
- **生成数据集大小:** 0.48 Mb
- **总磁盘占用:** 1.71 Mb
「test」划分的示例如下:
{
"source": "text: Пожалуйста выберите чат, чтобы начать общение",
"correction": "text: Пожалуйста, выберите чат, чтобы начать общение.",
}
### 数据字段
#### RUSpellRU
- `source`:字符串类型特征
- `correction`:字符串类型特征
- `domain`:字符串类型特征
#### MultidomainGold
- `source`:字符串类型特征
- `correction`:字符串类型特征
- `domain`:字符串类型特征
#### MedSpellcheck
- `source`:字符串类型特征
- `correction`:字符串类型特征
- `domain`:字符串类型特征
#### GitHubTypoCorpusRu
- `source`:字符串类型特征
- `correction`:字符串类型特征
- `domain`:字符串类型特征
### 数据划分
#### RUSpellRU
| |训练集|测试集|
|---|---:|---:|
|RUSpellRU|2000|2008|
#### MultidomainGold
| |训练集|测试集|
|---|---:|---:|
|web|385|756|
|news|361|245|
|social_media|430|200|
|reviews|583|585|
|subtitles|1810|1810|
|strategic_documents|-|250|
|literature|-|260|
#### MedSpellcheck
| |测试集|
|---|---:|
|MedSpellcheck|1054|
#### GitHubTypoCorpusRu
| |测试集|
|---|---:|
|GitHubTypoCorpusRu|868|
## 数据集构建
### 源数据
#### 初始数据采集与标准化
本数据集按照指定标准筛选得到:
1. 领域多样性:一半数据集来自不同领域以保证多样性,另一半则来自单一领域;
2. 错误类型:数据集仅包含拼写错误、语法疏漏或非母语使用者犯下的复杂错误。
各子数据集的来源如下:
- **RUSpellRU**:文本采集自[LiveJournal](https://www.livejournal.com/media),已手动修正其中的拼写错误;
- **MultidomainGold**:从多个文本来源采集得到,包括公开网络、新闻、社交网络、评论、字幕、政策文件与文学作品:
* **Aranea网络语料库(Aranea web-corpus)**:多语言十亿级网络语料库家族,采集自互联网资源。语料库中的文本在发布时间、写作风格与主题上分布均匀。我们从Araneum Russicum俄语网络语料库(Araneum Russicum)中随机选取句子,该语料库采集自俄语互联网资源。
* **文学作品**:收录了不同经典俄语诗歌与散文作品。我们从Ilibrary、LitLib与Wikisource采集的源数据集中随机选取句子。
* **新闻**:顾名思义,涵盖体育、政治、环境、经济等多个主题的新闻文章。文本片段随机选自Gazeta.ru摘要数据集。
* **社交网络**:来自带有特定话题标签的社交平台文本,通常篇幅较短、风格非正式,可能包含俚语、表情符号与粗鄙词汇。
* **战略文件**:数据集的一部分由俄罗斯联邦经济发展部采集,文本采用官方公文风格,包含大量嵌入式实体,句法与语篇结构复杂。该数据集的完整版本曾用于RuREBus共享任务。
- **MedSpellcheck**:来自病历记录的带错误文本;
- **GitHubTypoCorpusRu**:来自[GitHub](https://github.com)提交记录中的拼写错误与打字失误。
### 标注流程
#### 标注过程
我们通过众包平台Toloka搭建了两阶段标注项目:
1. **数据采集阶段**:向标注者提供带有潜在错误的文本,要求他们写出正确的句子;
2. **验证阶段**:向标注者提供源句与前一阶段生成的修正句组成的句子对,要求他们检查修正是否正确。
我们为每个任务准备了标注指南,指南要求标注者在不改变文本原有风格的前提下修正拼写错误。指南未针对错误来源的区分制定严格标准——无论是为了赋予句子特定风格特征而有意为之,还是非母语使用者无意犯下的拼写错误,因为逐一描述俚语、方言、口语等非规范语言使用的所有场景既耗时又费力。指南也未区分源于不同地域或社会背景的错误,而是依赖标注者的语言知识与理解能力,因为本研究的核心目标是保留文本的原有风格。
为确保标注质量,我们在两个阶段均使用小批量数据进行测试迭代:手动验证测试结果后,筛选出至少完成6个样本(占总测试迭代样本的2%)且无错误的标注者。测试迭代后,我们分别淘汰了85%和86%的采集阶段与验证阶段标注者。
我们特别提醒标注者修正将“ё”“й”替换为对应“е”“и”“ш”的错误,且无需解释缩写,同时需修正标点错误。每位标注者还收到了数据中可能存在敏感主题(如政治、社会少数群体、宗教)的预警。
标点错误的标注仅进行了一轮迭代,因为该任务相对于拼写校正来说变化较少且难度较低。标注者需按照俄语标点系统的规则修正标点符号。
#### 标注者资质
标注者均为通过语言能力测试的俄语母语使用者。负责标点错误标注的标注者同时为专业编辑与语言学者。
### 个人与敏感信息
## 数据集使用注意事项
### 数据集的社会影响
我们明确阐述了本研究的目标与意义,并以开源形式发布,确保研究过程透明。本数据集将以公共许可协议发布。由于本研究使用的是匿名文本数据,因此无需获取人类参与者的知情同意。但我们已获得使用公开数据集的许可,并确保遵守所有适用的服务条款与使用政策。
### 偏见讨论
我们明确阐述了本研究的目标与意义,并以开源形式发布,确保研究过程透明。本数据集将以公共许可协议发布。由于本研究使用的是匿名文本数据,因此无需获取人类参与者的知情同意。但我们已获得使用公开数据集的许可,并确保遵守所有适用的服务条款与使用政策。
### 其他已知局限性
本研究使用的数据可能局限于特定领域,无法全面覆盖所有可能的文本变体。尽管存在这些局限性,我们通过在研究中纳入单一领域与多领域数据集,尝试解决数据多样性问题。该方法帮助我们揭示了数据内部的多样性与差异,尽管存在固有约束,但仍提供了有价值的研究视角。
本研究主要聚焦于俄语语言,未来仍需开展更多研究以扩展数据集至更多语言。
## 附加信息
### 未来规划
我们计划将本基准扩展至新的俄语数据集与其他语言的数据集,包括但不限于欧洲语言与独联体国家语言。若您有意贡献,请与我们联系。
### 数据集维护者
Nikita Martynov nikita.martynov.98@list.ru
### 许可信息
本数据集所有内容均采用MIT许可协议发布。
### 引用信息
@inproceedings{martynov2023augmentation,
title={Augmentation methods for spelling corruptions},
author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena},
booktitle={Proceedings of the International Conference "Dialogue},
volume={2023},
year={2023}
}
@inproceedings{martynov-etal-2024-methodology,
title = "A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages",
author = "Martynov, Nikita and
Baushenko, Mark and
Kozlova, Anastasia and
Kolomeytseva, Katerina and
Abramov, Aleksandr and
Fenogenova, Alena",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
month = mar,
year = {2024},
address = "St. Julian's, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-eacl.10",
pages = "138--155",
abstract = "Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models' pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models' architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).",
}
### 贡献
提供机构:
maas
创建时间:
2025-05-26



