the-cramer-project/Misspelled-KG-dataset

Name: the-cramer-project/Misspelled-KG-dataset
Creator: the-cramer-project
Published: 2024-05-05 03:29:50
License: 暂无描述

Hugging Face2024-05-05 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/the-cramer-project/Misspelled-KG-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - ky pretty_name: Misspelled_kg_dataset size_categories: - 1M<n<10M --- # Misspelled_kg_dataset This dataset is prepared based on https://huggingface.co/datasets/the-cramer-project/Kyrgyz_News_Corpus. Preliminary processing has been carried out: 1. All symbols that are absent in the Kyrgyz or Latin alphabets or numbers have been excluded. 2. Various variants of dashes/hyphens have been replaced with a single type of dash, different variants of quotation marks have been replaced with a single type of quotation mark, and extra spaces have been removed. 3. Long news articles have been divided into lines so that mean(len) = 102.45 and std(len) = 56.72. 4. Rows with languages other than Kyrgyz have been excluded. Misspelled (trash) text was created using various approaches: * 1 million trash lines were generated using a probabilistic noiser. The probabilistic noiser was trained based on a "golden dataset" with real errors, which is not public. * 500 thousand trash lines were generated using a different probabilistic noiser (https://github.com/ai-forever/sage.git). * The remaining trash lines were created using a random noiser, which, for words longer than 5 letters, has a 20% probability of deleting a letter/swapping a letter/replacing a letter with another letter/inserting any letter. Punctuation errors (punc_trash) text was created using a random noiser, which has a 20% probability of deleting/inserting a comma and replacing the period at the end of the sentence with another punctuation mark, such as "!" or "?". Train and test datasets were created by train_test_split with a train size of 2 million: * Train size = 2000000 * Test size = 66223 # References All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to the Kyrgyz news agencies for their work, which allowed us to create this dataset. # Next We work on creation Kyrgyz Spell checker and grammar corrector. Please feel free to reach out timur.turat@gmail.com or rkizmailov@gmail.com if you are interested in any forms of collaborations! --- license: cc-by-nc-4.0 ---

提供机构：

the-cramer-project

原始信息汇总

数据集概述

数据集名称

名称: Misspelled_kg_dataset

数据集语言

语言: Kyrgyz (ky)

数据集大小

大小: 1M<n<10M

数据集处理

预处理:
- 排除非Kyrgyz或Latin字母表中的符号及数字。
- 统一了破折号、引号的使用，并移除了多余的空格。
- 将长新闻文章分割成行，使得平均长度为102.45，标准差为56.72。
- 排除了非Kyrgyz语言的行。

数据集内容

错误文本生成:
- 使用概率噪声器生成了100万条错误行，该噪声器基于非公开的“黄金数据集”训练。
- 使用另一个概率噪声器（来自https://github.com/ai-forever/sage.git）生成了500千条错误行。
- 使用随机噪声器生成了剩余的错误行，对于长度超过5个字母的单词，有20%的概率进行字母删除、交换、替换或插入。
标点错误文本生成:
- 使用随机噪声器，有20%的概率进行逗号删除/插入，或将句末的句号替换为其他标点符号，如“!”或“?”。

数据集分割

训练集大小: 2000000
测试集大小: 66223

许可证

许可证: CC-BY-NC-4.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集