five

the-cramer-project/Misspelled-KG-dataset

收藏
Hugging Face2024-05-05 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/the-cramer-project/Misspelled-KG-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - ky pretty_name: Misspelled_kg_dataset size_categories: - 1M<n<10M --- # Misspelled_kg_dataset This dataset is prepared based on https://huggingface.co/datasets/the-cramer-project/Kyrgyz_News_Corpus. Preliminary processing has been carried out: 1. All symbols that are absent in the Kyrgyz or Latin alphabets or numbers have been excluded. 2. Various variants of dashes/hyphens have been replaced with a single type of dash, different variants of quotation marks have been replaced with a single type of quotation mark, and extra spaces have been removed. 3. Long news articles have been divided into lines so that mean(len) = 102.45 and std(len) = 56.72. 4. Rows with languages other than Kyrgyz have been excluded. Misspelled (trash) text was created using various approaches: * 1 million trash lines were generated using a probabilistic noiser. The probabilistic noiser was trained based on a "golden dataset" with real errors, which is not public. * 500 thousand trash lines were generated using a different probabilistic noiser (https://github.com/ai-forever/sage.git). * The remaining trash lines were created using a random noiser, which, for words longer than 5 letters, has a 20% probability of deleting a letter/swapping a letter/replacing a letter with another letter/inserting any letter. Punctuation errors (punc_trash) text was created using a random noiser, which has a 20% probability of deleting/inserting a comma and replacing the period at the end of the sentence with another punctuation mark, such as "!" or "?". Train and test datasets were created by train_test_split with a train size of 2 million: * Train size = 2000000 * Test size = 66223 # References All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to the Kyrgyz news agencies for their work, which allowed us to create this dataset. # Next We work on creation Kyrgyz Spell checker and grammar corrector. Please feel free to reach out timur.turat@gmail.com or rkizmailov@gmail.com if you are interested in any forms of collaborations! --- license: cc-by-nc-4.0 ---
提供机构:
the-cramer-project
原始信息汇总

数据集概述

数据集名称

  • 名称: Misspelled_kg_dataset

数据集语言

  • 语言: Kyrgyz (ky)

数据集大小

  • 大小: 1M<n<10M

数据集处理

  • 预处理:
    • 排除非Kyrgyz或Latin字母表中的符号及数字。
    • 统一了破折号、引号的使用,并移除了多余的空格。
    • 将长新闻文章分割成行,使得平均长度为102.45,标准差为56.72。
    • 排除了非Kyrgyz语言的行。

数据集内容

  • 错误文本生成:

    • 使用概率噪声器生成了100万条错误行,该噪声器基于非公开的“黄金数据集”训练。
    • 使用另一个概率噪声器(来自https://github.com/ai-forever/sage.git)生成了500千条错误行。
    • 使用随机噪声器生成了剩余的错误行,对于长度超过5个字母的单词,有20%的概率进行字母删除、交换、替换或插入。
  • 标点错误文本生成:

    • 使用随机噪声器,有20%的概率进行逗号删除/插入,或将句末的句号替换为其他标点符号,如“!”或“?”。

数据集分割

  • 训练集大小: 2000000
  • 测试集大小: 66223

许可证

  • 许可证: CC-BY-NC-4.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作