Chrisyichuan/screenshot-training-natural-filtered-v2
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Chrisyichuan/screenshot-training-natural-filtered-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-retrieval
- question-answering
language:
- en
pretty_name: screenshot-training
size_categories:
- 10K<n<100K
---
# Chrisyichuan/screenshot-training-natural-filtered-v2
Wikipedia screenshot retrieval training dataset exported from local hard-negative mining.
## Contents
- `train.jsonl` / `train_hn.jsonl`
- `eval.jsonl` / `eval_hn.jsonl`
- `test.jsonl` / `test_hn.jsonl`
- `train_hn_with_answer.jsonl` / `eval_hn_with_answer.jsonl` / `test_hn_with_answer.jsonl`
- `lite-query-v2-full-filtered-hn-with-answer.jsonl`
- `images/`
Each metadata row has the form:
```json
{
"query": "...",
"chunk_path": "images/shard_123/shard_00001/123456.png.tiles/chunk_0000_00.png",
"neg_chunk_paths": [
"images/shard_234/shard_00002/234567.png.tiles/chunk_0000_01.png"
],
"split": "train"
}
```
The answer-enriched metadata adds one more field:
```json
{
"query": "...",
"chunk_path": "images/shard_123/shard_00001/123456.png.tiles/chunk_0000_00.png",
"neg_chunk_paths": [
"images/shard_234/shard_00002/234567.png.tiles/chunk_0000_01.png"
],
"answer": "...",
"split": "train"
}
```
## Split sizes
- train: 104033
- eval: 5779
- test: 5781
## Notes
- Image paths are stored relative to the dataset root.
- The source images were deduplicated before export so repeated hard negatives only upload once.
- The primary split files are query-filtered hard-negative metadata without answers.
- Additional `*_with_answer.jsonl` files were joined back to the original
`lite-query-v2-full-filtered.jsonl` source via `(query, chunk_path)` with
`100.0%` match rate for this cleaned subset.
## Image Storage
The images are stored as `1000` tar shards under `image_shards/` to keep
the repository file count low and make uploads/downloads more reliable.
To materialize the images locally after download:
```bash
python extract_hf_image_shards.py --dataset-dir .
```
提供机构:
Chrisyichuan



