abdur75648/UTRSet-Synth
收藏Hugging Face2024-01-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/abdur75648/UTRSet-Synth
下载链接
链接失效反馈官方服务:
资源简介:
---
title: UrduSet-Synth (UTRNet)
emoji: 📖
colorFrom: red
colorTo: green
license: cc-by-nc-4.0
task_categories:
- image-to-text
language:
- ur
tags:
- ocr
- text recognition
- urdu-ocr
- utrnet
pretty_name: UTRSet-Synth
references:
- https://github.com/abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition
- https://abdur75648.github.io/UTRNet/
- https://arxiv.org/abs/2306.15782
---
The **UTRSet-Synth** dataset is introduced as a complementary training resource to the [**UTRSet-Real** Dataset](https://paperswithcode.com/dataset/utrset-real), specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.
To generate the dataset, a custom-designed synthetic data generation module which offers precise control over variations in crucial factors such as font, text size, colour, resolution, orientation, noise, style, and background, was employed. Moreover, the UTRSet-Synth dataset tackles the limitations observed in existing datasets. It addresses the challenge of standardizing fonts by incorporating over 130 diverse Urdu fonts, which were thoroughly refined to ensure consistent rendering schemes. It overcomes the scarcity of Arabic words, numerals, and Urdu digits by incorporating a significant number of samples representing these elements. Additionally, the dataset is enriched by randomly selecting words from a vocabulary of 100,000 words during the text generation process. As a result, UTRSet-Synth contains a total of 28,187 unique words, with an average word length of 7 characters.
The availability of the UTRSet-Synth dataset, a synthetic dataset that closely emulates real-world variations, addresses the scarcity of comprehensive real-world printed Urdu OCR datasets. By providing researchers with a valuable resource for developing and benchmarking Urdu OCR models, this dataset promotes standardized evaluation, and reproducibility, and fosters advancements in the field of Urdu OCR. For more information and details about the [UTRSet-Real](https://paperswithcode.com/dataset/utrset-real) & [UTRSet-Synth](https://paperswithcode.com/dataset/utrset-synth) datasets, please refer to the paper ["UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"](https://arxiv.org/abs/2306.15782)
提供机构:
abdur75648
原始信息汇总
UTRSet-Synth 数据集概述
数据集简介
UTRSet-Synth 数据集是为了增强乌尔都语 OCR 模型的效果而设计的合成数据集。它作为 UTRSet-Real 数据集 的补充训练资源,包含 20,000 行高质量的合成数据,这些数据与现实世界中的乌尔都语文本非常接近。
数据集生成
数据集通过一个定制设计的合成数据生成模块生成,该模块提供了对字体、文本大小、颜色、分辨率、方向、噪声、样式和背景等关键因素的精确控制。此外,UTRSet-Synth 数据集解决了现有数据集的局限性,通过包含超过 130 种多样化的乌尔都语字体,并对其进行了彻底的精炼以确保一致的渲染方案。
数据集特点
- 字体多样性:包含超过 130 种乌尔都语字体。
- 内容丰富性:克服了阿拉伯单词、数字和乌尔都语数字的稀缺性,包含了大量代表这些元素的样本。
- 词汇多样性:在文本生成过程中随机选择 100,000 个单词的词汇表中的单词,共计包含 28,187 个独特单词,平均单词长度为 7 个字符。
数据集意义
UTRSet-Synth 数据集通过模拟现实世界的变量,解决了全面真实世界印刷乌尔都语 OCR 数据集的稀缺性。它为研究人员提供了一个宝贵的资源,用于开发和基准测试乌尔都语 OCR 模型,促进了标准化评估、可重复性和该领域的进步。
参考文献
更多关于 UTRSet-Real 和 UTRSet-Synth 数据集的详细信息,请参阅论文 "UTRNet: High-Resolution Urdu Text Recognition In Printed Documents"。



