Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern

Name: Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern
Creator: 国际大学
Published: 2020-05-21 23:49:16
License: 暂无描述

arXiv2020-05-21 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2003.03484v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究创建了一个名为‘Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern’的数据集，由国际大学开发。该数据集包含8637个最频繁出现的孟加拉语单词，这些单词是通过网络爬虫从多家知名新闻网站收集的，时间跨度为2017至2019年。数据集的创建过程涉及分析孟加拉语的书写模式，并使用QWERTY布局的英文键盘自动生成拼写错误的单词。该数据集主要用于评估孟加拉语拼写检查器的性能，解决拼写错误问题，特别是在使用英文键盘书写孟加拉语时产生的错误。

This study developed a dataset titled "Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern", which was constructed by an international university. The dataset contains 8,637 most frequently occurring Bengali words, collected via web crawling from multiple reputable news websites between 2017 and 2019. The dataset creation process involves analyzing the writing patterns of Bengali, and automatically generating misspelled words using an English QWERTY keyboard. This dataset is primarily used to evaluate the performance of Bengali spell checkers and resolve spelling errors, particularly those induced when typing Bengali with an English QWERTY keyboard.

提供机构：

国际大学

创建时间：

2020-03-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集