OCR Telugu Image Dataset

Name: OCR Telugu Image Dataset
Creator: IEEE Dataport
License: 暂无描述

ieee-dataport.org2025-01-21 收录

下载链接：

https://ieee-dataport.org/documents/ocr-telugu-image-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

The choice of the dataset is the key for OCR systems. Unfortunately, there are very few works on Telugu character datasets. The work by Pramod et al has 500 words and an average of 50 images with 50 fonts in four styles for training data each image of size 48x48 per category. They used the most frequently occurring words in Telugu but were unable to cover all the words in Telugu. Later works were based on character level. The dataset by Hastie has 460 classes and 160 samples per class which is made up of 500 images. However, these works have not utilized all the possible combinations of Vatu and Gunitha's. Here, we propose a dataset that takes into consideration all possible combinations of Vatu and Gunitha's with 17387 categories and nearly 560 samples per class. All the images are of size 32x32. There are 6,757,044 training samples, 972,309 validation samples, and 1,934,190 test samples which add up to (2 GB) images.Each character has been augmented with 20 different fonts downloaded from 5 different sizes, random rotations, additive Gaussian noise, and spatial transformations. Our dataset is novel because, unlike other datasets which only take into account the commonly occurring permutations of characters and vatu's, we have spanned the entire Telugu alphabet and their corresponding vatu and Gunitha's.Machine learning with deep neural network (DNN) models is called "deep learning." Several "neurons" are stacked in layers to form the computational mathematical model known as a deep neural network (DNN). An input is given to the network at the beginning of the chain of neurons, and it is transformed and passed down through the network to the final layer, where it is given out as the output. The Gradient Descent technique [Barzilai & Brewin, 1988] is commonly used to modify the network's weights. There are two main phases to the learning process, called epochs and batches, respectively. The batch size is the number of samples the network takes before determining the gradients. The training dataset's size establishes how many batches make up the epoch. A neuron layer is comprised of the neurons that receive each neuron's output, the neurons that generate that output, and another parameter. The network cannot learn its hyperparameters, so they must be defined and tuned beforehand. These hyperparameters include the number of layers, the number of neurons in each layer, the batch size, the learning rate, and others.The hyperparameters can be fine-tuned or optimized by repeatedly training the network with a small set of initial hyperparameter values and testing its performance on a validation set. To get the best results in the least amount of time, the values are adjusted after each cycle. Different layers of deep learning, each with its own set of skills, have emerged over the years. The network can "remember" previously calculated data (a character in this study) and learn corrections concerning an entire sequence (a sentence in this study) with the help of a deep learning layer like LSTM (Long Short-Term Memory) [Hochreiter & Schmidhuber, 1997]. Gated recurrent units are another popular layer in neural networks [Cho et al., 2014]. There are two possible orders in which to implement the LSTM and GRU layers. This method, called bidirectional-RNN [Schuster & Paliwal, 1997], enables the layer to pick up on a broader Con Telugu text by correcting based on the words that come after them as well as those that came before. This method replicates the transfer of data throughout the network twice, once in the conventional direction and once in the reverse (ending-to-beginning) direction. A variety of models, such as the sequence-to-sequence model [Stuever, Vinals, & Le, 2014] built with an encoder-decoder architecture [Cho et al., 2014], are constructed using these layers. Furthermore, the type and complexity of the task determine the necessary number of layers for the encoder and decoder. A "dropout" regulation layer is added [Srivastava, Hinton, Krichevsky, Stuever, & Salahuddin, 2014] to prevent overfitting and solve the vanishing gradient problem [Hochreiter, 1998]. For the network to be forced to generalize and not rely solely on a one-to-one correspondence between input and output, several neurons are removed at the beginning of each learning cycle.Deep learning models using a variety of DNNs have proven to be highly effective in many NLP tasks in recent years. In particular, these models are very effective at fixing spelling mistakes [Raaijmakers, 2013], translating Telugu text [Mokhtar, et al., 2018], and learning editing operations [Cherupara, 2014]. Consequently, it seems like a promising direction to use deep learning to fix OCR errors, particularly the encoder-decoder architecture that is central to the Neural Machine Translation (NMT) approach, which maps (i.e. translates) a sequence input (e.g., a sequence of OCRed characters with errors) to a sequence output. The sequence is encoded by the encoder into a Con Telugu text vector of a fixed size that the decoder will use to predict the output (the correct sequence of characters, for example). Most solutions seem to be incorporating neural networks at this point, as shown by the results of a recent competition in the field [Chiron, Doucet, & Moreaux, 2017; Rigaud, et al., 2019]. Dataset Contain & Diversity: Containing more than 100 images, this Telugu OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc. on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Telugu text. The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes. We have observed that the dictionary can never be completely populated with all the words because of the complex morphology of the language. The union of two words results in a morphology where the last character of the first word and the first character of the second word are combined to get another character and the rest of the characters remain the same. Sandhi's and samosas result in countless combinations of words that cannot be included in the dictionary unless we use a sandhi splitter making the recognition of words much simpler. This phenomenon results in an infinite word list which makes the purpose of the dictionary useless. This problem can be overcome by using a sandhi splitter, which will break a word into its root words which can be easily found in the dictionary of words. The quality of the image has a lot of impact in the later stages of the process of character recognition. We have observed that during the binarization process, the characters. are broken when there's a thin line and faded background and if the characters which are at the back cover are visible to the front cover and the background is light, then these characters are also included on the front page which leads to the garbled image. This can be overcome by focusing more on the digital processing of the image. Training the character variants is highly difficult due to the huge number of combinations of characters possible and the nuances add to the complexity. One possible solution is the brute force method of training all variants of characters with a large number of samples per variant instead of training the vowel modifiers alone. The other approach is to instead, the output can be processed for finding the disconnected character variants and joining them. All these shopping lists were written, and images were captured by native Telugu people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats. Conclusion: Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Telugu language. Your journey to improved language understanding and processing begins here.

数据集的选择是光学字符识别系统成功的关键。遗憾的是，关于泰卢固文字符数据集的研究寥寥无几。Pramod等人所做的研究包含500个单词，平均每张图像包含50种字体，每种字体以四种风格提供50张图像作为训练数据，每张图像尺寸为48x48像素。他们使用了泰卢固语中最常出现的单词，但未能涵盖泰卢固语中的所有单词。后续的研究基于字符级别。Hastie的数据集包含460个类别，每个类别有160个样本，共计500张图像。然而，这些研究并未充分利用Vatu和Gunitha的所有可能组合。在此，我们提出一个数据集，考虑了Vatu和Gunitha的所有可能组合，包含17387个类别，每个类别近560个样本。所有图像尺寸均为32x32像素。数据集包括6,757,044个训练样本、972,309个验证样本和1,934,190个测试样本，总计约2GB的图像。每个字符都通过从5种不同尺寸的20种不同字体中下载，随机旋转、加性高斯噪声和空间变换进行增强。我们的数据集新颖之处在于，与其他仅考虑字符和vatu常见排列的数据集不同，我们涵盖了整个泰卢固字母表及其对应的vatu和Gunitha。利用深度神经网络（DNN）模型进行机器学习被称为深度学习。通过层层堆叠多个“神经元”形成一个被称为深度神经网络（DNN）的计算数学模型。输入在网络链的第一个神经元处给出，并通过网络变换并传递至最终层，在那里它作为输出给出。梯度下降技术[Barzilai & Brewin, 1988]通常用于修改网络的权重。学习过程分为两个主要阶段，分别称为epoch和batch。batch大小是网络在确定梯度之前所采用的样本数量。训练数据集的大小决定了epoch中包含的batch数量。神经元层由接收每个神经元输出的神经元、生成该输出的神经元以及另一个参数组成。网络无法学习其超参数，因此必须在事先定义和调整。这些超参数包括层数、每层的神经元数量、batch大小、学习率等。通过使用一小组初始超参数值重复训练网络并测试其在验证集上的性能，可以微调或优化超参数。为了在尽可能短的时间内获得最佳结果，在每个循环后调整这些值。随着年份的推移，深度学习的不同层级相继出现，每个层级都拥有其独特的技能。网络可以“记忆”之前计算的数据（在本研究中为一个字符），并在LSTM（长短期记忆）[Hochreiter & Schmidhuber, 1997]等深度学习层帮助下，学习整个序列（在本研究中为句子）的纠正。门控循环单元是神经网络中另一种流行的层[Cho et al., 2014]。LSTM和GRU层可以以两种可能的顺序实现。这种方法称为双向-RNN[Schuster & Paliwal, 1997]，使得层能够通过纠正其后及之前的单词来捕捉更广泛的泰卢固文本。此方法将数据在网络中转移两次，一次是传统方向，一次是反向（从尾到头）方向。使用编码器-解码器架构[Cho et al., 2014]构建的序列到序列模型[Stuever, Vinals, & Le, 2014]等不同模型都是使用这些层构建的。此外，任务的类型和复杂度决定了编码器和解码器所需层数。为了防止过拟合和解决梯度消失问题[Hochreiter, 1998]，添加了一个“dropout”调节层[Srivastava, Hinton, Krichevsky, Stuever, & Salahuddin, 2014]。网络被迫泛化，而不是仅仅依赖于输入和输出之间的一对一对应关系，在每次学习循环的开始时移除几个神经元。近年来，使用各种DNN的深度学习模型已被证明在许多自然语言处理任务中非常有效。特别是，这些模型在纠正拼写错误[Raaijmakers, 2013]、翻译泰卢固文本[Mokhtar, et al., 2018]和学习编辑操作[Cherupara, 2014]方面非常有效。因此，利用深度学习来纠正OCR错误，尤其是神经网络机器翻译（NMT）方法的核心——编码器-解码器架构，将序列输入（例如，有错误的OCR字符序列）映射到序列输出，似乎是一个有希望的方向。大多数解决方案似乎都在此阶段纳入了神经网络，正如最近该领域竞赛的结果所示[Chiron, Doucet, & Moreaux, 2017; Rigaud, et al., 2019]。数据集包含与多样性：包含超过100张图像的泰卢固OCR数据集提供了不同类型购物清单图像的广泛分布。在此数据集中，可以发现各种手写文本，包括句子、单个物品名称单词、数量、评论等。数据集中的图像展示了不同的书写风格、字体、字体大小和书写变化。为了确保训练OCR模型时的多样性和鲁棒性，我们允许单个书写中只有少量（少于三个）独特的图像。这确保了我们有各种书写类型来训练OCR模型。已采取严格措施排除任何个人身份信息（PII），并确保每个图像中至少有80%的空间包含可见的泰卢固文本。图像在不同光照条件下拍摄，包括白天和夜晚，以及不同的拍摄角度和背景。这种多样性有助于构建平衡的OCR数据集，包括肖像和横幅模式的图像。我们观察到，由于语言的复杂形态，字典永远无法完全包含所有单词。两个单词的联合会产生一种形态，其中第一个单词的最后一个字符和第二个单词的第一个字符结合得到另一个字符，其余字符保持不变。Sandhi和samossa产生了无数不能包含在字典中的单词组合，除非我们使用sandhi拆分器，这使得单词识别变得简单得多。这种现象导致了一个无限的单词列表，使得字典的目的变得无用。通过使用sandhi拆分器，可以将单词分解为其根词，这些根词可以轻松地在单词字典中找到。图像质量对字符识别过程的后阶段有很大影响。我们观察到，在二值化过程中，当有细线和褪色的背景时，字符会被破坏，如果后面的字符覆盖前面的字符且背景较亮，则这些字符也会出现在前页上，导致图像混乱。这可以通过更多地关注图像的数字处理来解决。由于可能的字符组合数量巨大，字符变体的训练非常困难。一种可能的解决方案是使用暴力方法训练字符变体的所有变体，每个变体都有大量的样本，而不是仅训练元音修饰符。另一种方法是，输出可以处理以查找未连接的字符变体并将它们连接起来。所有这些购物清单都是由本地的泰卢固人编写的，图像由具有5MP以上摄像头的iOS和Android移动设备捕获，以确保图像质量。训练数据集中的图像以JPEG和HEIC格式提供。结论：利用此购物清单图像OCR数据集，以提高泰卢固语言文本识别、文本检测和光学字符识别模型的训练和性能。您的语言理解和处理之旅从这里开始。

提供机构：

IEEE Dataport

5,000+

优质数据集

54 个

任务类型

进入经典数据集