GitHub Typo Corpus

Name: GitHub Typo Corpus
Creator: Octanove Labs, Seattle, WA, USA 2 RIKEN AIP, Tokyo, Japan 3 Tohoku University, Miyagi, Japan
Published: 2019-11-29 06:57:45
License: 暂无描述

arXiv2019-11-29 更新2024-06-21 收录

下载链接：

https://github.com/mhagiwara/github-typo-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

GitHub Typo Corpus是由Octanove Labs、RIKEN AIP和Tohoku University联合创建的大型多语言数据集，专注于收集和纠正GitHub平台上的拼写错误和语法错误。该数据集包含超过35万条编辑记录，涵盖超过15种语言，总计6400万字符，是目前最大的拼写错误数据集。创建过程中，研究者通过提取符合条件的仓库和提交记录，使用语言检测和监督分类器过滤非人类语言和非拼写相关的编辑。数据集主要应用于拼写纠正和语法错误纠正领域，旨在提供一个丰富的自然发生的拼写和语法错误资源，以促进相关NLP任务的发展。

The GitHub Typo Corpus is a large-scale multilingual dataset jointly developed by Octanove Labs, RIKEN AIP, and Tohoku University, focusing on collecting and correcting spelling and grammatical errors on the GitHub platform. This dataset contains over 350,000 editing records, covering more than 15 languages with a total of 64 million characters, making it the largest spelling error dataset to date. During its creation, researchers extracted eligible repositories and commit records, and filtered out non-human languages and non-spelling-related edits using language detection and supervised classifiers. The dataset is primarily applied in the fields of spelling correction and grammatical error correction, aiming to provide a rich resource of naturally occurring spelling and grammatical errors to facilitate the advancement of related NLP tasks.

提供机构：

Octanove Labs, Seattle, WA, USA 2 RIKEN AIP, Tokyo, Japan 3 Tohoku University, Miyagi, Japan

创建时间：

2019-11-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集