pixparse/pdfa-eng-wds

Name: pixparse/pdfa-eng-wds
Creator: pixparse
Published: 2024-03-29 17:19:37
License: 暂无描述

Hugging Face2024-03-29 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/pixparse/pdfa-eng-wds

下载链接

链接失效反馈

官方服务：

资源简介：

PDFA数据集是从SafeDocs语料库中筛选出来的文档数据集，主要用于视觉-语言模型的机器学习。数据集包含PDF文档及其对应的JSON文件，JSON文件中包含了OCR注释和元数据信息。数据集经过过滤，去除了过大或渲染过慢的文件，并限制为英语文档。数据集以webdataset格式提供，适用于大规模的多模态机器学习任务。

The PDFA Dataset is a document dataset curated from the SafeDocs corpus, primarily intended for machine learning with vision-language models. The dataset includes PDF documents and their corresponding JSON files, which contain OCR annotations and metadata. It has been filtered to remove files that are excessively large or slow to render, and is restricted to English-language documents. The dataset is provided in the webdataset format, suitable for large-scale multimodal machine learning tasks.

提供机构：

pixparse

原始信息汇总