Doc-750K
收藏魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/Doc-750K
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is in the paper [Docopilot: Improving Multimodal Models for Document-Level Understanding](https://arxiv.org/abs/2507.14675).
Please refer to https://github.com/OpenGVLab/Docopilot for details.
## FAQ
### Unzipping Split Archives on Linux
If you encounter issues when unzipping the image archive on Linux, such as:
- zip bomb warnings
- bad zipfile offset errors
Please try the following solutions:
1. Zip Bomb Warning
Some systems may trigger a zip bomb detection warning due to the large number of small image files. You can bypass this by disabling the detection with:
```bash
export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE
```
2. Bad Zipfile Offset Error
If you're dealing with split zip archives (e.g., images.z01, images.z02, ..., images.zip), you need to merge them before unzipping:
```bash
zip -s 0 images.zip --out images_full.zip
unzip images_full.zip
```
This will reconstruct the full archive and allow you to unzip it normally.
Note: The image dataset is very large, so please ensure you have sufficient disk space and patience during extraction.
本数据集来自论文《Docopilot:面向文档级理解的多模态模型优化》(https://arxiv.org/abs/2507.14675)。
详细信息请参考https://github.com/OpenGVLab/Docopilot。
### Linux系统下拆分压缩包的解压问题
若您在Linux系统中解压图像压缩包时遇到以下问题:
- 压缩炸弹警告
- ZIP文件偏移错误
请尝试以下解决方案:
1. 压缩炸弹警告
部分系统会因包含大量小型图像文件而触发压缩炸弹检测警告,可通过以下命令关闭检测以绕过该问题:
bash
export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE
2. ZIP文件偏移错误
若您遇到的是拆分式ZIP压缩包(例如images.z01、images.z02……images.zip),则需先合并再解压:
bash
zip -s 0 images.zip --out images_full.zip
unzip images_full.zip
该命令可重建完整压缩包,随后即可正常解压。
注意:本图像数据集体积较大,请确保拥有足够的磁盘空间,并在解压过程中保持耐心。
提供机构:
maas
创建时间:
2025-07-20



