Chrisyichuan/moca-colpali-training
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Chrisyichuan/moca-colpali-training
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-retrieval
- question-answering
language:
- en
pretty_name: MOCA ColPali Training
size_categories:
- 100K<n<1M
---
# Chrisyichuan/moca-colpali-training
MOCA ColPali contrastive training data with hard negatives.
## Contents
- `moca_colpali_converted.jsonl` — query-image pairs with hard negatives
- `images/` — all referenced images
Each metadata row:
```json
{
"query": "...",
"chunk_path": "images/...",
"neg_chunk_paths": ["images/...", "images/..."],
"source_positive_rank": 0,
"source_positive_score": 0.0,
"source_dataset": "moca"
}
```
## Summary
- rows: 118195
- unique images: 118195
- avg negatives/row: 2.00
## Download
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Chrisyichuan/moca-colpali-training", repo_type="dataset", local_dir="data/moca-colpali-training")
```
## Image Storage
Images are stored as **119 tar shards** under `image_shards/` for fast download.
After cloning/downloading, extract images:
```bash
python extract_hf_image_shards.py --dataset-dir .
```
This creates `images/` with all referenced image files.
提供机构:
Chrisyichuan



