five

UPShf/FlowTalk-V1.1_ImageNet-1k-captions_captions-only

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/UPShf/FlowTalk-V1.1_ImageNet-1k-captions_captions-only
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 pretty_name: UPShf/FlowTalk (planned) V1.1 - ImageNet-1k captions (captions only) tags: - image-captioning - imagenet - qwen task_categories: - image-to-text language: - en size_categories: - 100K<n<1M --- ![image](https://cdn-uploads.huggingface.co/production/uploads/69c678dd8ad32029cdb3231d/3LvGgyRIhIGja5kE0qYAT.png) # UPShf/FlowTalk (planned) V1.1 - ImageNet-1k captions (captions only) This repository is intended to be the **captions-only** release used for *UPShf/FlowTalk (planned) V1.1* training. It **does not** ship any ImageNet images (public redistribution is typically not permitted). Users must link these captions to **their own local copy** of ImageNet-1k images (for example the 256x256 variant from `benjamin-paine/imagenet-1k-256x256`). Base images: `benjamin-paine/imagenet-1k-256x256`. Caption generation pipeline: Sigma-Captioner (SGLANG branch): https://github.com/uninterruptedpowersupply3-NEW/Sigma-Captioner/tree/SGLANG ## What is in the dataset This repo can contain two **captions-only** JSONL formats (both are sharded, and may also be provided as a best store-level (in WinRar) ZIP with "Best" compression): ### A) Path-mapped captions (recommended) Files: - `mapped-*.jsonl` (keyed by ImageNet parquet `image.path`) Each JSONL line looks like: ```json {"imagenet_path":"...","split":"train|validation|test","label":123,"global_idx":456789,"caption":"...","caption_source":"..."} ``` Notes: - `imagenet_path` is the stable identifier from the parquet column `image.path` in `benjamin-paine/imagenet-1k-256x256`. - Some rows are intentionally missing (low quality filtering / evaluation removals), so you should expect a partial match vs the full ImageNet-1k set. ### B) pHash captions (robust to renames / slight re-encodes) Files: - `captions-*.jsonl` (keyed by a perceptual hash stored in the `sha256` field for historical reasons) Each JSONL line looks like: ```json {"sha256":"d8fa381417597327","hash_mode":"perceptual_hash_phash","caption":"...","caption_source":"...","image_filename":"..."} ``` Notes: - `hash_mode=perceptual_hash_phash` means `sha256` is **NOT** a real SHA-256; it is a pHash string (16 hex chars). ### Build stats (current release) - Captions exported / mapped: **891,549** - JSONs with no caption (filtered out): **2,298** - Full ImageNet-1k rows scanned (parquet): **1,431,167** ## Linking captions to your local images (order does not matter) If your images are extracted in a different order (e.g. "image 12 is image 31"), you can still link captions by filename: - Find the local file that matches `imagenet_path`. - Use `split` to know whether it is train/validation/test. If your local copy renamed files, use the pHash JSONL: - Compute the same pHash (`imagehash.phash`) over your local images - Join on the 16-hex string in the `sha256` field Warning: - pHash is designed for similarity, not cryptographic uniqueness. If you want extra safety, do a second-stage verification (e.g., also compare byte-sha256 after a candidate match). ## How to reproduce locally (Windows) From the root of this workspace, this command: - uses both caption sources (`CaptinedIMGNET` and `extracted_images`) - reads only parquet metadata (`image.path`, `label`) and caption JSONs - does not load image bytes ```powershell python .\codexGPT5.2HIGH.py build-mapped-index ` --parquet_dir .\imagenet-1k-256x256\data ` --out_dir .\flowtalk_mapped_shards ` --out_zip .\flowtalk_v1.1_imagenet1k_captions_mapped_store.zip ` --batch_size 50000 --queue 50000 --shard_size 50000 --workers 8 ``` To be explicit (no auto-detect), add: - `--pairs_dir .\imagenet-1k-256x256\CaptinedIMGNET` - `--pairs_dir .\imagenet-1k-256x256\extracted_images` To generate the pHash JSONL: ```powershell python .\codexGPT5.2HIGH.py export-captions ` --caption_dir .\imagenet-1k-256x256\CaptinedIMGNET ` --caption_dir .\imagenet-1k-256x256\extracted_images ` --out_dir .\flowtalk_phash_shards ` --hash_mode phash ` --shard_size 50000 --queue 5000 --workers 8 ` --skip_missing_images ``` ### Performance notes - `--pairs_lookup ram` (default) scans the caption directories once and does O(1) in-RAM lookups instead of millions of per-row `exists()` calls. - Optional (faster JSON): `pip install orjson` (the script uses it automatically when available). - Optional (progress bar): `pip install tqdm`. - Optional (perceptual hashing for `export-captions --hash_mode phash`): `pip install ImageHash` (imports as `imagehash`). ## Licensing & Dataset Usage *Disclaimer: This section is provided for informational purposes only and does not constitute legal advice.* - **Captions & Metadata:** The text captions and JSON metadata generated in this repository are released under the **Apache-2.0** license. - **Underlying Images:** The original ImageNet images are **NOT** redistributed in this repository. Users must obtain the images independently and comply with the official ImageNet terms of access, as well as any upstream dataset terms (such as `benjamin-paine/imagenet-1k-256x256`). - **Copyright:** This dataset provides derivative text descriptions. Users are responsible for ensuring their use of the combined text and image data complies with all applicable licenses. ## Known Limitations & Bugs - **Language & Vocabulary Constraints:** This dataset is intended to be entirely in English. However, because the captions and tags were generated by automated AI models, there are a few edge cases to be aware of: - **Hallucinations:** Rare instances of non-English characters or words may occur due to standard Vision-Language Model hallucinations. - **Loanwords & Entities:** Tag-based captions may include widely accepted loanwords (e.g., "taco", "sushi"), proper nouns, or domain-specific terminology that some strict language filters might flag as non-English. If you are training a strict English-only model, you may want to apply a basic vocabulary filter to the text before training to catch any edge cases. ## Model credits (captioning) Captions were produced using a multi-model pipeline, including: - BLIP captions: `Salesforce/blip-image-captioning-large` (stored as `blip.caption`) - Tag-style captions (previously labeled `wd_tagger` in JSON): `Qwen/Qwen3-VL-Embedding-2B` using `UPShf/Vocabulary-Qwen3-VL-Embedding-2B` (stored as `wd_tagger.caption`) - Sigma-Captioner (SGLANG) QA captions: `Qwen/Qwen3.5-2B` (stored under `sglang.qa_pairs` in the raw JSON) Captioner mix (from a metadata scan over **893,847** caption JSON files; selection order `sglang > blip > wd_tagger`): | Captioner | Count | % | |---|---:|---:| | `Salesforce/blip-image-captioning-large` | 496,043 | 55.50% | | `Qwen/Qwen3-VL-Embedding-2B` (+ `UPShf/Vocabulary-Qwen3-VL-Embedding-2B`) | 376,544 | 42.13% | | `Qwen/Qwen3.5-2B` (Sigma-Captioner / SGLANG) | 21,260 | 2.38% |
提供机构:
UPShf
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作