five

Nexdata/39993_Images_OCR_Data_of_Internet_Image

收藏
Hugging Face2024-04-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Nexdata/39993_Images_OCR_Data_of_Internet_Image
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-nd-4.0 --- ## Description 39,993 Images – OCR Data of Internet Image. The collecting scenes of this dataset include subtitle, advertisement, cellphone screenshot, comic, emoticon, poster, magazine cover, etc. The language distribution is Chinese, English (a few). For annotation, line-level rectangular bounding box annotation and transcription for the texts were adopted for the internet images (column-level quadrilateral bounding box annotation and transcription for the texts were adopted for small amount of data). The dataset can be used for OCR tasks of internet images. For more details, please refer to the link: https://www.nexdata.ai/dataset/171?source=Huggingface ## Data size 39,993 images, 227,910 bounding boxes ## Collecting environment including subtitle, advertisement, cellphone screenshot, comic, emoticon, poster, magazine cover etc. ## Data diversity including multiple types of internet images ## Language distribution Chinese, English (a few) ## Data format the image data format is .jpg, the annotation file format is .json ## Annotation content line-level rectangular bounding box annotation and transcription for the texts (column-level quadrilateral bounding box annotation and transcription for the texts were adopted for small amount of data) ## Accuracy the error bound of each vertex of a rectangular bounding box is within 5 pixels, which is a qualified annotation, the accuracy of bounding boxes is not less than 97%; the texts transcription accuracy is not less than 97% # Licensing Information Commercial License
提供机构:
Nexdata
原始信息汇总

数据集概述

数据集描述

  • 图像数量: 39,993张
  • 数据来源: 网络图像,包括字幕、广告、手机截图、漫画、表情包、海报、杂志封面等
  • 语言分布: 中文为主,少量英文
  • 标注方式: 主要采用行级矩形边界框标注及文本转录,少量数据采用列级四边形边界框标注及文本转录
  • 应用场景: 适用于网络图像的OCR任务

数据规模

  • 图像数量: 39,993张
  • 边界框数量: 227,910个

收集环境

  • 包含多种网络图像环境,如字幕、广告、手机截图、漫画、表情包、海报、杂志封面等

数据多样性

  • 包含多种类型的网络图像

语言分布

  • 主要语言: 中文
  • 次要语言: 英文(少量)

数据格式

  • 图像格式: .jpg
  • 标注文件格式: .json

标注内容

  • 主要采用行级矩形边界框标注及文本转录
  • 少量数据采用列级四边形边界框标注及文本转录

准确性

  • 边界框准确性: 每个顶点的误差范围在5像素内,边界框准确率不低于97%
  • 文本转录准确性: 准确率不低于97%

许可信息

  • 许可证: 商业许可证
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作