U4R/DocGenome
收藏Hugging Face2024-12-18 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/U4R/DocGenome
下载链接
链接失效反馈官方服务:
资源简介:
DocGenome是一个开放的大型科学文档数据集,用于训练和测试多模态大型语言模型。它包含了来自arXiv开放获取社区的153个学科领域的500K个科学文档,使用了自定义的自动标注管道DocParser构建而成。该数据集具有完整性、逻辑性、多样性和正确性的特点,支持多种文档导向任务,如文档分类、视觉定位、文档布局检测、文档转换、开放式单页问答和多页问答等。
DocGenome is an open large-scale scientific document dataset for training and testing multi-modal large language models. It contains 500K scientific documents from 153 disciplines in the arXiv open-access community, constructed using our custom auto-labeling pipeline DocParser. The dataset features completeness, logicality, diversity, and correctness, and supports various document-oriented tasks such as document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA, and multi-page QA.
提供机构:
U4R



