UPShf/FlowTalk-V1.1_ImageNet-1k-captions_captions-only
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/UPShf/FlowTalk-V1.1_ImageNet-1k-captions_captions-only
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
pretty_name: UPShf/FlowTalk (planned) V1.1 - ImageNet-1k captions (captions only)
tags:
- image-captioning
- imagenet
- qwen
task_categories:
- image-to-text
language:
- en
size_categories:
- 100K<n<1M
---

# UPShf/FlowTalk (planned) V1.1 - ImageNet-1k captions (captions only)
This repository is intended to be the **captions-only** release used for *UPShf/FlowTalk (planned) V1.1* training.
It **does not** ship any ImageNet images (public redistribution is typically not permitted). Users must link these captions to **their own local copy** of ImageNet-1k images (for example the 256x256 variant from `benjamin-paine/imagenet-1k-256x256`).
Base images: `benjamin-paine/imagenet-1k-256x256`.
Caption generation pipeline: Sigma-Captioner (SGLANG branch):
https://github.com/uninterruptedpowersupply3-NEW/Sigma-Captioner/tree/SGLANG
## What is in the dataset
This repo can contain two **captions-only** JSONL formats (both are sharded, and may also be provided as a best store-level (in WinRar) ZIP with "Best" compression):
### A) Path-mapped captions (recommended)
Files:
- `mapped-*.jsonl` (keyed by ImageNet parquet `image.path`)
Each JSONL line looks like:
```json
{"imagenet_path":"...","split":"train|validation|test","label":123,"global_idx":456789,"caption":"...","caption_source":"..."}
```
Notes:
- `imagenet_path` is the stable identifier from the parquet column `image.path` in `benjamin-paine/imagenet-1k-256x256`.
- Some rows are intentionally missing (low quality filtering / evaluation removals), so you should expect a partial match vs the full ImageNet-1k set.
### B) pHash captions (robust to renames / slight re-encodes)
Files:
- `captions-*.jsonl` (keyed by a perceptual hash stored in the `sha256` field for historical reasons)
Each JSONL line looks like:
```json
{"sha256":"d8fa381417597327","hash_mode":"perceptual_hash_phash","caption":"...","caption_source":"...","image_filename":"..."}
```
Notes:
- `hash_mode=perceptual_hash_phash` means `sha256` is **NOT** a real SHA-256; it is a pHash string (16 hex chars).
### Build stats (current release)
- Captions exported / mapped: **891,549**
- JSONs with no caption (filtered out): **2,298**
- Full ImageNet-1k rows scanned (parquet): **1,431,167**
## Linking captions to your local images (order does not matter)
If your images are extracted in a different order (e.g. "image 12 is image 31"), you can still link captions by filename:
- Find the local file that matches `imagenet_path`.
- Use `split` to know whether it is train/validation/test.
If your local copy renamed files, use the pHash JSONL:
- Compute the same pHash (`imagehash.phash`) over your local images
- Join on the 16-hex string in the `sha256` field
Warning:
- pHash is designed for similarity, not cryptographic uniqueness. If you want extra safety, do a second-stage verification (e.g., also compare byte-sha256 after a candidate match).
## How to reproduce locally (Windows)
From the root of this workspace, this command:
- uses both caption sources (`CaptinedIMGNET` and `extracted_images`)
- reads only parquet metadata (`image.path`, `label`) and caption JSONs
- does not load image bytes
```powershell
python .\codexGPT5.2HIGH.py build-mapped-index `
--parquet_dir .\imagenet-1k-256x256\data `
--out_dir .\flowtalk_mapped_shards `
--out_zip .\flowtalk_v1.1_imagenet1k_captions_mapped_store.zip `
--batch_size 50000 --queue 50000 --shard_size 50000 --workers 8
```
To be explicit (no auto-detect), add:
- `--pairs_dir .\imagenet-1k-256x256\CaptinedIMGNET`
- `--pairs_dir .\imagenet-1k-256x256\extracted_images`
To generate the pHash JSONL:
```powershell
python .\codexGPT5.2HIGH.py export-captions `
--caption_dir .\imagenet-1k-256x256\CaptinedIMGNET `
--caption_dir .\imagenet-1k-256x256\extracted_images `
--out_dir .\flowtalk_phash_shards `
--hash_mode phash `
--shard_size 50000 --queue 5000 --workers 8 `
--skip_missing_images
```
### Performance notes
- `--pairs_lookup ram` (default) scans the caption directories once and does O(1) in-RAM lookups instead of millions of per-row `exists()` calls.
- Optional (faster JSON): `pip install orjson` (the script uses it automatically when available).
- Optional (progress bar): `pip install tqdm`.
- Optional (perceptual hashing for `export-captions --hash_mode phash`): `pip install ImageHash` (imports as `imagehash`).
## Licensing & Dataset Usage
*Disclaimer: This section is provided for informational purposes only and does not constitute legal advice.*
- **Captions & Metadata:** The text captions and JSON metadata generated in this repository are released under the **Apache-2.0** license.
- **Underlying Images:** The original ImageNet images are **NOT** redistributed in this repository. Users must obtain the images independently and comply with the official ImageNet terms of access, as well as any upstream dataset terms (such as `benjamin-paine/imagenet-1k-256x256`).
- **Copyright:** This dataset provides derivative text descriptions. Users are responsible for ensuring their use of the combined text and image data complies with all applicable licenses.
## Known Limitations & Bugs
- **Language & Vocabulary Constraints:** This dataset is intended to be entirely in English. However, because the captions and tags were generated by automated AI models, there are a few edge cases to be aware of:
- **Hallucinations:** Rare instances of non-English characters or words may occur due to standard Vision-Language Model hallucinations.
- **Loanwords & Entities:** Tag-based captions may include widely accepted loanwords (e.g., "taco", "sushi"), proper nouns, or domain-specific terminology that some strict language filters might flag as non-English.
If you are training a strict English-only model, you may want to apply a basic vocabulary filter to the text before training to catch any edge cases.
## Model credits (captioning)
Captions were produced using a multi-model pipeline, including:
- BLIP captions: `Salesforce/blip-image-captioning-large` (stored as `blip.caption`)
- Tag-style captions (previously labeled `wd_tagger` in JSON): `Qwen/Qwen3-VL-Embedding-2B` using `UPShf/Vocabulary-Qwen3-VL-Embedding-2B` (stored as `wd_tagger.caption`)
- Sigma-Captioner (SGLANG) QA captions: `Qwen/Qwen3.5-2B` (stored under `sglang.qa_pairs` in the raw JSON)
Captioner mix (from a metadata scan over **893,847** caption JSON files; selection order `sglang > blip > wd_tagger`):
| Captioner | Count | % |
|---|---:|---:|
| `Salesforce/blip-image-captioning-large` | 496,043 | 55.50% |
| `Qwen/Qwen3-VL-Embedding-2B` (+ `UPShf/Vocabulary-Qwen3-VL-Embedding-2B`) | 376,544 | 42.13% |
| `Qwen/Qwen3.5-2B` (Sigma-Captioner / SGLANG) | 21,260 | 2.38% |
提供机构:
UPShf



