five

PeiyangLiu/wiki-coe

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeiyangLiu/wiki-coe
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - visual-question-answering - document-question-answering language: - en tags: - multi-hop - visual-attribution - chain-of-evidence - document-understanding size_categories: - 10K<n<100K --- # Wiki-CoE Wiki-CoE is a multi-hop visual QA dataset built from Wikipedia article screenshots for the Chain-of-Evidence (CoE) framework. Each example contains a question, a gold answer, an evidence chain of (image, bounding box, sub-query) hops, and a bag of candidate screenshots in original resolution. ## Contents The dataset is distributed as a single zstd-compressed tarball (`wiki_coe_full.tar.zst`, ~116 GB) split into 3 parts to satisfy the Hugging Face 50 GB per-file limit: | File | Size | |-------------------------------|-------| | `wiki_coe_full.tar.zst.part00`| 40 GB | | `wiki_coe_full.tar.zst.part01`| 40 GB | | `wiki_coe_full.tar.zst.part02`| 36 GB | | `wiki_coe_full.md5` | md5 of reassembled tarball | After extraction you get: ``` wiki_coe/ ├── screenshots/ # 151,988 PNG screenshots (~127 GB raw) ├── bbox_annotations/ # per-page bbox JSONs ├── train.jsonl # CoE training samples ├── val.jsonl # validation samples └── test.jsonl # test samples ``` ## Reassemble & extract ```bash # 1. Download all three parts (and the md5 file) # 2. Concatenate them in order: cat wiki_coe_full.tar.zst.part* > wiki_coe_full.tar.zst # 3. (Optional) verify integrity: md5sum -c wiki_coe_full.md5 # 4. Extract (requires zstd): tar -I 'zstd -d' -xf wiki_coe_full.tar.zst ``` ## Citation If you use this dataset, please cite the Chain-of-Evidence paper.
提供机构:
PeiyangLiu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作