five

Webvicob

收藏
arXiv2023-05-02 更新2024-06-21 收录
下载链接:
https://github.com/clovaai/webvicob
下载链接
链接失效反馈
官方服务:
资源简介:
Webvicob数据集由NAVER CLOVA创建,旨在通过自监督学习方法提升视觉文档理解能力。该数据集通过从Wikipedia HTML转储中构建大规模多语言视觉语料库,包含超过1800万条数据,涵盖英语、中文、日语等多种语言。Webvicob利用HTML转储和DOM修改生成数据,提供丰富的文本和图像注释,包括字符、单词、行和段落级别的注释。该数据集适用于训练VDU模型,解决文档图像处理中的问题,如DocVQA和后OCR解析,已在多项任务中显示出优于传统数据集的性能。

The Webvicob dataset was developed by NAVER CLOVA to advance visual document understanding (VDU) capabilities through self-supervised learning approaches. This dataset constructs a large-scale multilingual visual corpus sourced from Wikipedia HTML dumps, comprising over 18 million data entries covering diverse languages including English, Chinese, Japanese, and more. Webvicob generates training data using HTML dumps and DOM modifications, and offers comprehensive textual and image annotations at the character, word, line, and paragraph levels. Designed for training VDU models, this dataset addresses core challenges in document image processing such as Document Visual Question Answering (DocVQA) and post-OCR parsing, and has exhibited better performance than conventional datasets across multiple benchmark tasks.
提供机构:
NAVER CLOVA
创建时间:
2022-11-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作