five

Bengali.AI Handwritten Graphemes

收藏
OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Bengali_AI_Handwritten_Graphemes
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含单个手写孟加拉语字符的图像。孟加拉语字符(字形)是通过组合三个组件编写的:字形根、元音变音符号和辅音变音符号。您的挑战是对每个图像中字素的组成部分进行分类。大约有 10,000 个可能的字素,其中大约 1,000 个在训练集中表示。测试集包括一些在 train 中不存在但没有新的字素组件的字素。需要很多志愿者填写这样的表格才能生成有用数量的真实数据;将问题集中在字素组件上而不是识别整个字素应该可以组装一个孟加拉语 OCR 系统,而无需为所有 10,000 个字素提供手写样本。

This dataset contains images of individual handwritten Bengali characters. Bengali characters (graphemes) are constructed by combining three components: grapheme roots, vowel diacritics, and consonant diacritics. The core challenge is to classify the constituent components of each grapheme in the input images. There are roughly 10,000 potential graphemes, with approximately 1,000 of them represented in the training dataset. The test set encompasses certain graphemes that are absent from the training set but do not involve any new grapheme components. Generating a viable quantity of authentic real-world data necessitates the participation of a large number of volunteers who complete such forms; by focusing the classification task on grapheme components rather than full grapheme recognition, it becomes possible to build a Bengali OCR system without requiring handwritten samples for all 10,000 possible graphemes.
提供机构:
OpenDataLab
创建时间:
2022-09-01
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含手写孟加拉语字符图像,专注于对字形根、元音变音符号和辅音变音符号三个组件进行分类,以支持孟加拉语OCR系统开发。训练集涵盖约1,000个可能字素,而测试集包含未见过的字素组合,总计涉及约10,000个可能字素。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作