five

spellcheck_benchmark

收藏
魔搭社区2025-07-04 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/ai-forever/spellcheck_benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Russian Spellcheck Benchmark ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [SAGE](https://github.com/ai-forever/sage) - **Paper:** [arXiv:2308.09435](https://arxiv.org/abs/2308.09435) - **Point of Contact:** nikita.martynov.98@list.ru ### Dataset Summary Spellcheck Benchmark includes four datasets, each of which consists of pairs of sentences in Russian language. Each pair embodies sentence, which may contain spelling errors, and its corresponding correction. Datasets were gathered from various sources and domains including social networks, internet blogs, github commits, medical anamnesis, literature, news, reviews and more. All datasets were passed through two-stage manual labeling pipeline. The correction of a sentence is defined by an agreement of at least two human annotators. Manual labeling scheme accounts for jargonisms, collocations and common language, hence in some cases it encourages annotators not to amend a word in favor of preserving style of a text. ### Supported Tasks and Leaderboards - **Task:** automatic spelling correction. - **Metrics:** https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf. ### Languages Russian. ## Dataset Structure ### Data Instances #### RUSpellRU - **Size of downloaded dataset files:** 3.64 Mb - **Size of the generated dataset:** 1.29 Mb - **Total amount of disk used:** 4.93 Mb An example of "train" / "test" looks as follows ``` { "source": "очень классная тетка ктобы что не говорил.", "correction": "очень классная тетка кто бы что ни говорил", } ``` #### MultidomainGold - **Size of downloaded dataset files:** 15.05 Mb - **Size of the generated dataset:** 5.43 Mb - **Total amount of disk used:** 20.48 Mb An example of "test" looks as follows ``` { "source": "Ну что могу сказать... Я заказала 2 вязанных платья: за 1000 руб (у др продавца) и это ща 1200. Это платье- голимая синтетика (в том платье в составе была шерсть). Это платье как очень плохая резинка. На свои параметры (83-60-85) я заказала С . Пока одевала/снимала - оно в горловине растянулось. Помимо этого в этом платье я выгляжу ну очень тоской. У меня вес 43 кг на 165 см роста. Кстати, продавец отправлял платье очень долго. Я пыталась отказаться от заказа, но он постоянно отклонял мой запрос. В общем не советую.", "correction": "Ну что могу сказать... Я заказала 2 вязаных платья: за 1000 руб (у др продавца) и это ща 1200. Это платье- голимая синтетика (в том платье в составе была шерсть). Это платье как очень плохая резинка. На свои параметры (83-60-85) я заказала С . Пока надевала/снимала - оно в горловине растянулось. Помимо этого в этом платье я выгляжу ну очень доской. У меня вес 43 кг на 165 см роста. Кстати, продавец отправлял платье очень долго. Я пыталась отказаться от заказа, но он постоянно отклонял мой запрос. В общем не советую.", "domain": "reviews", } ``` #### MedSpellcheck - **Size of downloaded dataset files:** 1.49 Mb - **Size of the generated dataset:** 0.54 Mb - **Total amount of disk used:** 2.03 Mb An example of "test" looks as follows ``` { "source": "Кровотечения, поерации в анамнезе отрицает", "correction": "Кровотечения, операции в анамнезе отрицает", } ``` #### GitHubTypoCorpusRu - **Size of downloaded dataset files:** 1.23 Mb - **Size of the generated dataset:** 0.48 Mb - **Total amount of disk used:** 1.71 Mb An example of "test" looks as follows ``` { "source": "## Запросы и ответа содержат заголовки", "correction": "## Запросы и ответы содержат заголовки", } ``` ### Data Fields #### RUSpellRU - `source`: a `string` feature - `correction`: a `string` feature - `domain`: a `string` feature #### MultidomainGold - `source`: a `string` feature - `correction`: a `string` feature - `domain`: a `string` feature #### MedSpellcheck - `source`: a `string` feature - `correction`: a `string` feature - `domain`: a `string` feature #### GitHubTypoCorpusRu - `source`: a `string` feature - `correction`: a `string` feature - `domain`: a `string` feature ### Data Splits #### RUSpellRU | |train|test| |---|---:|---:| |RUSpellRU|2000|2008| #### MultidomainGold | |train|test| |---|---:|---:| |web|386|756| |news|361|245| |social_media|430|200| |reviews|584|586| |subtitles|1810|1810| |strategic_documents|-|250| |literature|-|260| #### MedSpellcheck | |test| |---|---:| |MedSpellcheck|1054| #### GitHubTypoCorpusRu | |test| |---|---:| |GitHubTypoCorpusRu|868| ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization The datasets are chosen in accordance with the specified criteria. First, domain variation: half of the datasets are chosen from different domains to ensure diversity, while the remaining half are from a single domain. Another criterion is spelling orthographic mistakes: the datasets exclusively comprised mistyping, omitting grammatical or more complex errors of nonnative speakers. - **RUSpellRU**: texts collected from ([LiveJournal](https://www.livejournal.com/media)), with manually corrected typos and errors; - **MultidomainGold**: examples from several text sources including the open web, news, social media, reviews, subtitles, policy documents and literary works were collected: *Aranea web-corpus* is a family of multilanguage gigaword web-corpora collected from Internet resources. The texts in the corpora are evenly distributed across periods, writing styles and topics they cover. We randomly picked the sentences from Araneum Russicum, which is harvested from the Russian part of the web. *Literature* is a collection of Russian poems and prose of different classical literary works. We randomly picked sentences from the source dataset that were gathered from Ilibrary, LitLib, and Wikisource. *News*, as the name suggests, covers news articles on various topics such as sports, politics, environment, economy etc. The passages are randomly picked from the summarization dataset Gazeta.ru. *Social media* is the text domain from social media platforms marked with specific hashtags. These texts are typically short, written in an informal style and may contain slang, emojis and obscene lexis. *Strategic Documents* is part of the dataset the Ministry of Economic Development of the Russian Federation collected. Texts are written in a bureaucratic manner, rich in embedded entities, and have complex syntactic and discourse structures. The full version of the dataset has been previously used in the RuREBus shared task. - **MedSpellChecker**: texts with errors from medical anamnesis; - **GitHubTypoCorpusRu**: spelling errors and typos in commits from [GitHub](https://github.com); ### Annotations #### Annotation process We set up two-stage annotation project via a crowd-sourcing platform Toloka: 1. Data gathering stage: we provide the texts with possible mistakes to annotators and ask them to write the sentence correctly; 2. Validation stage: we provide annotators with the pair of sentences (source and its corresponding correction from the previous stage) and ask them to check if the correction is right. We prepared instructions for annotators for each task. The instructions ask annotators to correct misspellings if it does not alter the original style of the text. Instructions do not provide rigorous criteria on the matter of distinguishing the nature of an error in terms of its origin - whether it came from an urge to endow a sentence with particular stylistic features or from unintentional spelling violation since it is time-consuming and laborious to describe every possible case of employing slang, dialect, collo- quialisms, etc. instead of proper language. Instructions also do not distinguish errors that come from the geographical or social background of the source. Instead, we rely on annotators’ knowledge and understanding of a language since, in this work, the important factor is to preserve the original style of the text. To ensure we receive qualified expertise, we set up test iteration on a small subset of the data for both stages. We manually validated the test results and selected annotators, who processed at least six samples (2% of the total test iteration) and did not make a single error. After test iteration, we cut 85% and 86% of labellers for gathering and validation stages. We especially urge annotators to correct mistakes associated with the substitution of the letters "ё" "й" and "щ" for corresponding "е" "и" and "ш" and not to explain abbreviations and correct punctuation errors. Each annotator is also warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). #### Who are the annotators? Native Russian speakers who passed the language exam. ## Considerations for Using the Data ### Discussion of Biases We clearly state our work’s aims and implications, making it open source and transparent. The data will be available under a public license. As our research involved anonymized textual data, informed consent from human participants was not required. However, we obtained permission to access publicly available datasets and ensured compliance with any applicable terms of service or usage policies. ### Other Known Limitations The data used in our research may be limited to specific domains, preventing comprehensive coverage of all possible text variations. Despite these limitations, we tried to address the issue of data diversity by incorporating single-domain and multi-domain datasets in the proposed research. This approach allowed us to shed light on the diversity and variances within the data, providing valuable insights despite the inherent constraints. We primarily focus on the Russian language. Further research is needed to expand the datasets for a wider range of languages. ## Additional Information ### Future plans We are planning to expand our benchmark with both new Russian datasets and datasets in other languages including (but not limited to) European and CIS languages. If you would like to contribute, please contact us. ### Dataset Curators Nikita Martynov nikita.martynov.98@list.ru ### Licensing Information All our datasets are published by MIT License. ### Citation Information ``` @inproceedings{martynov2023augmentation, title={Augmentation methods for spelling corruptions}, author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena}, booktitle={Proceedings of the International Conference “Dialogue}, volume={2023}, year={2023} } @misc{martynov2023methodology, title={A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages}, author={Nikita Martynov and Mark Baushenko and Anastasia Kozlova and Katerina Kolomeytseva and Aleksandr Abramov and Alena Fenogenova}, year={2023}, eprint={2308.09435}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# 俄语拼写检查基准数据集卡片(Dataset Card) ## 目录(Table of Contents) - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集总结](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集遴选缘由](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知限制](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献方式](#contributions) ## 数据集概述(Dataset Description) - **仓库(Repository)**: [SAGE](https://github.com/ai-forever/sage) - **论文(Paper)**: [arXiv:2308.09435](https://arxiv.org/abs/2308.09435) - **联系方式(Point of Contact)**: nikita.martynov.98@list.ru ### 数据集总结(Dataset Summary) 本拼写检查基准数据集包含四个子数据集,每个子数据集均由俄语句子对组成。每一个句子对均包含一个可能存在拼写错误的源语句,以及其对应的修正版本。数据集采集自多个来源与领域,涵盖社交网络、互联网博客、GitHub(GitHub)提交记录、医疗病历(medical anamnesis)、文学作品、新闻、评论等。 所有数据集均经过两阶段人工标注流程处理。语句的修正结果需至少两名人工标注者达成一致方可确定。标注方案兼顾了行话、固定搭配与通用语言的使用场景,因此在部分情况下,标注者会选择保留原词,以维持文本的整体风格。 ### 支持任务与评测榜单(Supported Tasks and Leaderboards) - **任务(Task)**: 自动拼写校正(automatic spelling correction)。 - **指标(Metrics)**: https://www.dialog-21.ru/media/3427/sorokinaaetal.pdf. ### 语言(Languages) 俄语。 ## 数据集结构(Dataset Structure) ### 数据实例(Data Instances) #### RUSpellRU - **下载数据集文件大小**: 3.64 Mb - **生成数据集大小**: 1.29 Mb - **总磁盘占用**: 4.93 Mb "训练集(train)/测试集(test)"的示例如下: { "source": "очень классная тетка ктобы что не говорил.", "correction": "очень классная тетка кто бы что ни говорил", } #### MultidomainGold - **下载数据集文件大小**: 15.05 Mb - **生成数据集大小**: 5.43 Mb - **总磁盘占用**: 20.48 Mb "测试集(test)"的示例如下: { "source": "Ну что могу сказать... Я заказала 2 вязанных платья: за 1000 руб (у др продавца) и это ща 1200. Это платье- голимая синтетика (в том платье в составе была шерсть). Это платье как очень плохая резинка. На свои параметры (83-60-85) я заказала С . Пока одевала/снимала - оно в горловине растянулось. Помимо этого в этом платье я выгляжу ну очень тоской. У меня вес 43 кг на 165 см роста. Кстати, продавец отправлял платье очень долго. Я пыталась отказаться от заказа, но он постоянно отклонял мой запрос. В общем не советую.", "correction": "Ну что могу сказать... Я заказала 2 вязаных платья: за 1000 руб (у др продавца) и это ща 1200. Это платье- голимая синтетика (в том платье в составе была шерсть). Это платье как очень плохая резинка. На свои параметры (83-60-85) я заказала С . Пока надевала/снимала - оно в горловине растянулось. Помимо этого в этом платье я выгляжу ну очень доской. У меня вес 43 кг на 165 см роста. Кстати, продавец отправлял платье очень долго. Я пыталась отказаться от заказа, но он постоянно отклонял мой запрос. В общем не советую.", "domain": "reviews", } #### MedSpellcheck - **下载数据集文件大小**: 1.49 Mb - **生成数据集大小**: 0.54 Mb - **总磁盘占用**: 2.03 Mb "测试集(test)"的示例如下: { "source": "Кровотечения, поерации в анамнезе отрицает", "correction": "Кровотечения, операции в анамнезе отрицает", } #### GitHubTypoCorpusRu - **下载数据集文件大小**: 1.23 Mb - **生成数据集大小**: 0.48 Mb - **总磁盘占用**: 1.71 Mb "测试集(test)"的示例如下: { "source": "## Запросы и ответа содержат заголовки", "correction": "## Запросы и ответы содержат заголовки", } ### 数据字段(Data Fields) #### RUSpellRU - `source`: 字符串类型特征 - `correction`: 字符串类型特征 - `domain`: 字符串类型特征 #### MultidomainGold - `source`: 字符串类型特征 - `correction`: 字符串类型特征 - `domain`: 字符串类型特征 #### MedSpellcheck - `source`: 字符串类型特征 - `correction`: 字符串类型特征 - `domain`: 字符串类型特征 #### GitHubTypoCorpusRu - `source`: 字符串类型特征 - `correction`: 字符串类型特征 - `domain`: 字符串类型特征 ### 数据划分(Data Splits) #### RUSpellRU | |train|test| |---|---:|---:| |RUSpellRU|2000|2008| #### MultidomainGold | |train|test| |---|---:|---:| |web|386|756| |news|361|245| |social_media|430|200| |reviews|584|586| |subtitles|1810|1810| |strategic_documents|-|250| |literature|-|260| #### MedSpellcheck | |test| |---|---:| |MedSpellcheck|1054| #### GitHubTypoCorpusRu | |test| |---|---:| |GitHubTypoCorpusRu|868| ## 数据集构建(Dataset Creation) ### 源数据(Source Data) #### 初始数据采集与归一化(Initial Data Collection and Normalization) 数据集按照指定标准遴选:其一为领域多样性,一半子数据集来自不同领域以保证多样性,剩余一半则来自单一领域;其二为拼写错误类型,数据集仅包含拼写笔误,剔除语法错误或非母语使用者的复杂错误。 - **RUSpellRU**: 文本采集自生活日志(LiveJournal),并经人工修正拼写错误; - **MultidomainGold**: 示例采集自多个文本来源,包括公开网络、新闻、社交媒体、评论、字幕、政策文件与文学作品: *Aranea网络语料库(Aranea web-corpus)* 是多语言千兆词级网络语料库家族,采集自互联网资源,语料库文本在发布周期、写作风格与主题上分布均匀。我们从俄罗斯Araneum语料库(Araneum Russicum,即从网络俄语部分采集的语料库)中随机选取句子。 *文学作品*:收录了俄罗斯不同时期经典诗歌与散文作品,我们从Ilibrary、LitLib与维基文库(Wikisource)采集的源数据集中随机选取句子。 *新闻*:顾名思义,涵盖体育、政治、环境、经济等多领域新闻文章,段落随机取自《公报报》(Gazeta.ru)的摘要数据集。 *社交媒体*:来自带有特定标签的社交平台文本,这类文本通常较短,采用非正式语体,可能包含俚语、表情符号与不雅词汇。 *战略文件*:数据集的一部分由俄罗斯联邦经济发展部采集,文本采用官方公文语体,包含大量内嵌实体,句法与语篇结构复杂。该数据集的完整版本曾用于RuREBus共享任务。 - **MedSpellcheck**: 包含错误的医疗病历文本; - **GitHubTypoCorpusRu**: 来自GitHub(GitHub)提交记录中的拼写错误与笔误。 ### 标注(Annotations) #### 标注流程(Annotation process) 我们通过托洛卡标注平台(Toloka)搭建了两阶段标注项目: 1. 数据收集阶段:向标注者提供带有潜在错误的文本,要求他们写出正确的语句; 2. 验证阶段:向标注者提供句子对(源语句与第一阶段生成的对应修正结果),要求他们检查修正是否正确。 我们为每个任务准备了标注指南,要求标注者在不改变原文风格的前提下修正拼写错误。指南未就错误来源制定严格标准——无论是为赋予文本特定风格的刻意使用,还是无意的拼写失误,因为逐一描述俚语、方言、口语表达等替代规范语言的情况耗时费力。指南也未区分源文本使用者的地域或社会背景差异,而是依赖标注者的语言知识与理解能力,因为本研究的核心目标是保留文本的原始风格。 为确保标注质量,我们在两个阶段均使用少量数据进行测试迭代:手动验证测试结果,并筛选出至少完成6个样本(占总测试迭代样本的2%)且未出现任何错误的标注者。测试迭代后,我们淘汰了85%的收集阶段标注者与86%的验证阶段标注者。 我们特别提醒标注者注意将"ё""й""щ"误替换为"е""и""ш"的错误,且不得解释缩写或修正标点符号。同时,每位标注者均收到关于数据中可能存在敏感话题(如政治、社会少数群体、宗教)的预警。 #### 标注者资质(Who are the annotators?) 通过语言能力测试的俄语母语使用者。 ## 数据使用注意事项(Considerations for Using the Data) ### 偏差讨论(Discussion of Biases) 我们明确阐述了本研究的目标与意义,采用开源模式并保持流程透明。数据集将以公共许可协议发布。由于本研究使用的是匿名文本数据,因此无需获取人类参与者的知情同意。但我们已获得使用公开数据集的许可,并确保遵守相关服务条款与使用政策。 ### 其他已知限制(Other Known Limitations) 本研究使用的数据仅覆盖特定领域,无法涵盖所有可能的文本变体。尽管存在上述限制,我们通过同时引入单领域与多领域数据集,尽可能提升了数据的多样性。该方法有助于我们揭示数据内部的多样性与差异,即便存在固有约束,仍能提供具有价值的研究洞察。 本研究主要聚焦俄语,未来仍需开展更多研究以拓展数据集至更多语言。 ## 附加信息(Additional Information) ### 未来计划(Future plans) 我们计划将基准数据集拓展至更多俄语数据集与其他语言数据集,涵盖但不限于欧洲语言及独联体(CIS)国家语言。若有意贡献数据集,请与我们联系。 ### 数据集策展人(Dataset Curators) Nikita Martynov nikita.martynov.98@list.ru ### 许可信息(Licensing Information) 所有数据集均采用MIT许可证(MIT License)发布。 ### 引用信息(Citation Information) @inproceedings{martynov2023augmentation, title={Augmentation methods for spelling corruptions}, author={Martynov, Nikita and Baushenko, Mark and Abramov, Alexander and Fenogenova, Alena}, booktitle={Proceedings of the International Conference "Dialogue}, volume={2023}, year={2023} } @misc{martynov2023methodology, title={A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages}, author={Nikita Martynov and Mark Baushenko and Anastasia Kozlova and Katerina Kolomeytseva and Aleksandr Abramov and Alena Fenogenova}, year={2023}, eprint={2308.09435}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2025-05-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作