mlao01/spellchecker-km-news-small

Name: mlao01/spellchecker-km-news-small
Creator: mlao01
Published: 2024-07-15 03:52:02
License: 暂无描述

Hugging Face2024-07-15 更新2024-07-13 收录

下载链接：

https://hf-mirror.com/datasets/mlao01/spellchecker-km-news-small

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集由两个高棉语新闻数据集合并并清理，仅包含指定字符，用于纠正拼写。拼写错误的文本是人工生成的。数据集包含10,000个训练样本和1,000个验证样本，总大小为11,639,536字节。

This dataset is a combined and cleaned version of two Khmer news datasets, containing only specified characters. It is used for spelling correction tasks, where the misspelled column contains synthetically generated misspelled text and the corrected column contains the corrected text. The dataset is divided into a training set with 10000 samples and a validation set with 1000 samples. The sources of the dataset include two GitHub repositories.

提供机构：

mlao01

原始信息汇总

数据集概述

语言

高棉语 (km)

数据规模

1K < n < 10K

数据集信息

特征

misspelled: 类型为字符串 (string)
corrected: 类型为字符串 (string)

数据分割

train:
- 字节数: 10617782
- 样本数: 10000
validation:
- 字节数: 1021754
- 样本数: 1000

数据大小

下载大小: 4888380 字节
数据集大小: 11639536 字节

配置

config_name: default
- data_files:
  - train: data/train-*
  - validation: data/validation-*

数据来源

合并了两个高棉语新闻数据集，并进行了清理，仅包含指定字符。
拼写错误的文本是人工合成的。

字符选择

包含字符: "កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឨឩឪឫឬឭឮឯឰឱឲឳាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្៓។៕ៗ៛០១២៣៤៥៦៧៨៩"

5,000+

优质数据集

54 个

任务类型

进入经典数据集