albertklorer/safedocs
收藏Hugging Face2025-11-22 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/albertklorer/safedocs
下载链接
链接失效反馈官方服务:
资源简介:
SafeDocs数据集包含了150万份文档页面,来源于SafeDocs Common Crawl集合。这些页面经过OCR处理,具有单词级别的边界框。页面图像被调整大小至最大1024x1024像素并进行了高压缩。边界框坐标是原始(调整大小前)图像的尺寸。OCR处理使用了python-doctr库。页面经过筛选,保留了英文和英文多语言文档。需要注意的是,这些页面没有经过内容安全性过滤,可能包含成人、暴力、敏感或其他冒犯性材料。原始PDF文件是从公共网络爬取的,内容可能是专有的或受版权保护的,也可能包含个人或敏感信息。用户需要遵守相关的法律、许可和条款。
The SafeDocs dataset contains 1.5 million document pages from the SafeDocs Common Crawl collection. These pages have been OCRd with word-level bounding boxes. The page images have been resized to a maximum dimension of 1024x1024 and are heavily compressed. Bounding-box coordinates are in the original (pre-resize) image dimensions. OCR was performed using the python-doctr library. The pages have been filtered to keep English and English-multilingual documents. Please note that the pages have not been filtered for content safety and may include adult, violent, sensitive, or otherwise offensive material. The source PDFs were originally crawled from the public web; content may be proprietary or copyrighted and may include personal/sensitive information. Users are responsible for compliance with applicable laws, licenses, and terms of use.
提供机构:
albertklorer



