Chrisyichuan/screenshot-training
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Chrisyichuan/screenshot-training
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-retrieval
- question-answering
language:
- en
pretty_name: screenshot-training
size_categories:
- 10K<n<100K
---
# Chrisyichuan/screenshot-training
Wikipedia screenshot retrieval training dataset exported from local hard-negative mining.
## Contents
- `train.jsonl` / `train_hn.jsonl`
- `eval.jsonl` / `eval_hn.jsonl`
- `test.jsonl` / `test_hn.jsonl`
- `images/`
Each metadata row has the form:
```json
{
"query": "...",
"chunk_path": "images/shard_123/shard_00001/123456.png.tiles/chunk_0000_00.png",
"neg_chunk_paths": [
"images/shard_234/shard_00002/234567.png.tiles/chunk_0000_01.png"
],
"split": "train"
}
```
## Split sizes
- train: 44118
- eval: 2451
- test: 2451
## Notes
- Image paths are stored relative to the dataset root.
- The source images were deduplicated before export so repeated hard negatives only upload once.
- This export was prepared from the first 5 filtered hard-negative chunks.
## Image Storage
The images are stored as `1000` tar shards under `image_shards/` to keep
the repository file count low and make uploads/downloads more reliable.
To materialize the images locally after download:
```bash
python extract_hf_image_shards.py --dataset-dir .
```
许可证:MIT
任务类别:
- 图像检索(image-retrieval)
- 问答(question-answering)
语言:
- 英语
友好名称:screenshot-training
样本量范围:10K<n<100K
---
# Chrisyichuan/screenshot-training
本数据集为经本地难例挖掘(hard-negative mining)导出的维基百科截图检索训练数据集。
## 数据集内容
- `train.jsonl` / `train_hn.jsonl`
- `eval.jsonl` / `eval_hn.jsonl`
- `test.jsonl` / `test_hn.jsonl`
- `images/`
每条元数据行的格式如下:
json
{
"query": "...",
"chunk_path": "images/shard_123/shard_00001/123456.png.tiles/chunk_0000_00.png",
"neg_chunk_paths": [
"images/shard_234/shard_00002/234567.png.tiles/chunk_0000_01.png"
],
"split": "train"
}
## 数据集划分规模
- 训练集:44118
- 验证集:2451
- 测试集:2451
## 数据集说明
- 图像路径均相对于数据集根目录存储。
- 源图像在导出前已完成去重,因此重复的难例仅会上传一次。
- 本次导出基于前5个经过筛选的难例块制作。
## 图像存储
图像以1000个tar分块(tar shards)的形式存储于`image_shards/`目录下,以降低仓库内的文件数量,并提升上传与下载的可靠性。
如需在下载后于本地还原图像,可执行以下命令:
bash
python extract_hf_image_shards.py --dataset-dir .
提供机构:
Chrisyichuan



