PeiyangLiu/wiki-coe
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/PeiyangLiu/wiki-coe
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- visual-question-answering
- document-question-answering
language:
- en
tags:
- multi-hop
- visual-attribution
- chain-of-evidence
- document-understanding
size_categories:
- 10K<n<100K
---
# Wiki-CoE
Wiki-CoE is a multi-hop visual QA dataset built from Wikipedia article
screenshots for the Chain-of-Evidence (CoE) framework. Each example contains
a question, a gold answer, an evidence chain of (image, bounding box,
sub-query) hops, and a bag of candidate screenshots in original resolution.
## Contents
The dataset is distributed as a single zstd-compressed tarball
(`wiki_coe_full.tar.zst`, ~116 GB) split into 3 parts to satisfy the
Hugging Face 50 GB per-file limit:
| File | Size |
|-------------------------------|-------|
| `wiki_coe_full.tar.zst.part00`| 40 GB |
| `wiki_coe_full.tar.zst.part01`| 40 GB |
| `wiki_coe_full.tar.zst.part02`| 36 GB |
| `wiki_coe_full.md5` | md5 of reassembled tarball |
After extraction you get:
```
wiki_coe/
├── screenshots/ # 151,988 PNG screenshots (~127 GB raw)
├── bbox_annotations/ # per-page bbox JSONs
├── train.jsonl # CoE training samples
├── val.jsonl # validation samples
└── test.jsonl # test samples
```
## Reassemble & extract
```bash
# 1. Download all three parts (and the md5 file)
# 2. Concatenate them in order:
cat wiki_coe_full.tar.zst.part* > wiki_coe_full.tar.zst
# 3. (Optional) verify integrity:
md5sum -c wiki_coe_full.md5
# 4. Extract (requires zstd):
tar -I 'zstd -d' -xf wiki_coe_full.tar.zst
```
## Citation
If you use this dataset, please cite the Chain-of-Evidence paper.
提供机构:
PeiyangLiu



