undefined443/cc12m-wds-coco-recaptioned

Name: undefined443/cc12m-wds-coco-recaptioned
Creator: undefined443
Published: 2026-04-13 19:36:36
License: 暂无描述

Hugging Face2026-04-13 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/undefined443/cc12m-wds-coco-recaptioned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_ids: - image-captioning tags: - vision-language - cc12m - coco-style - image-text - webdataset - nemotron pretty_name: CC12M WebDataset with COCO-style Recaptions size_categories: - 1M<n<10M language: - en --- # CC12M WebDataset with COCO-style Recaptions A large-scale image-text dataset containing 3 million images from Conceptual Captions 12M (CC12M) with COCO-style factual descriptions generated using NVIDIA Nemotron Nano 12B v2 VL. ## Dataset Overview - **Base Dataset**: [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) - Conceptual Captions 12M (CC12M) - **Images**: 3,000,000+ high-quality internet images - **Recaption Model**: NVIDIA Nemotron Nano 12B v2 VL - **Recaption Style**: COCO-style factual descriptions (20 words average) - **Success Rate**: ~99.99% (2,986,571 successful captions) - **Format**: WebDataset (TAR archives) - **Total Size**: ~330 GB ## Features ✓ High-quality COCO-style image descriptions ✓ Concise, factual captions (3-25 words) ✓ No speculative language ("might", "appears", "suggests", etc.) ✓ Consistent caption quality validated by COCO standards ✓ Optimized for vision-language model training ## Data Format Each shard contains image-text pairs in WebDataset format: ``` shard-00000.tar ├── 000000004.jpg # Image file ├── 000000004.json # Metadata (url, key, status, recaption, etc.) ├── 000000008.jpg ├── 000000008.json └── ... ``` ### JSON Structure ```json { "url": "https://example.com/image.jpg", "key": "000000004", "status": "success", "error_message": null, "width": 768, "height": 512, "exif": "{}", "original_width": 930, "original_height": 620, "recaption": "Camera gear, including lenses, batteries, and a drone controller, is meticulously arranged on a wooden floor." } ``` ## Usage ### Loading with WebDataset ```python import webdataset as wds dataset = wds.WebDataset( 'pipe:cat cc12m-coco-{00000..00597}.tar' ).decode('pil').to_tuple('jpg', 'json') for img, meta in dataset: caption = meta['recaption'] print(caption) ``` ### Loading with Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset('undefined443/cc12m-wds-coco-recaptioned') ``` ## Recaption Generation Captions were generated using: - **Model**: NVIDIA Nemotron Nano 12B v2 VL (12B parameters) - **Prompt**: "Write a single factual sentence of no more than 20 words describing the main subject and action in this image. Start directly with the subject. Do not start with 'The image', 'The photo', or 'This image'. Be concise and objective." - **API**: NVIDIA NIM API (nvidia/nemotron-nano-12b-v2-vl) - **Validation**: Captions are validated against COCO-style quality standards: - Minimum 3 words, maximum 25 words - No "The/This image/photo/picture/screenshot" prefix - No speculative language (might, appears, suggests, possibly, etc.) - No markdown formatting or line breaks ## Statistics - **Total Images**: 3,000,000+ - **Successful Captions**: 2,986,571 - **Failed/Skipped**: 1 - **Success Rate**: 99.9999% - **Average Caption Length**: ~15 words - **Min Caption Length**: 3 words - **Max Caption Length**: 25 words ## File Organization The dataset is distributed across 598 shard files: - `cc12m-coco-00000.tar` to `cc12m-coco-00597.tar` - Each shard ~550-570 MB - Total uncompressed size: ~330 GB ## Citation If you use this dataset, please cite: ```bibtex @dataset{cc12m-wds-coco-recaptioned, title={CC12M WebDataset with COCO-style Recaptions}, author={Xiao Li}, year={2026}, howpublished={\url{https://huggingface.co/datasets/undefined443/cc12m-wds-coco-recaptioned}} } ``` Also cite the original CC12M dataset: ```bibtex @inproceedings{changpinyo2021conceptual, title={Conceptual 12M: Pushing web-scale image-text pre-training by disentangling visual and language representations}, author={Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={3558--3568}, year={2021} } ``` ## License The recaptions are provided under the same license as the original CC12M dataset. Please respect the original image licenses and usage rights. ## Disclaimer This is a derived dataset. The original images and captions are from CC12M, and the recaptions were automatically generated using a vision-language model. While care has been taken to ensure quality, some captions may not be perfect or accurate. Users should verify captions for critical applications. ## Contact For issues, questions, or feedback about this dataset, please open an issue on the Hugging Face repository.

license: 知识共享署名4.0（CC BY 4.0）许可 task_ids: - 图像字幕生成 tags: - 视觉语言 - cc12m - COCO风格 - 图像-文本 - WebDataset - Nemotron pretty_name: 带有COCO风格重字幕的CC12M WebDataset size_categories: - 100万<样本数<1000万 language: - 英语 --- # 带有COCO风格重字幕的CC12M WebDataset 本数据集为大规模图像-文本数据集，源自Conceptual Captions 12M（CC12M），包含300万张图像，并使用NVIDIA Nemotron Nano 12B v2 VL视觉语言模型生成了COCO风格的事实性描述字幕。 ## 数据集概览 - **基础数据集**：[pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) - Conceptual Captions 12M（CC12M） - **图像**：3,000,000+ 高质量互联网图像 - **重字幕模型**：NVIDIA Nemotron Nano 12B v2 VL - **重字幕风格**：COCO风格事实性描述（平均长度20词） - **生成成功率**：约99.99%（成功生成2,986,571条字幕） - **存储格式**：WebDataset（TAR归档文件） - **总大小**：约330 GB ## 数据集特性 ✓ 高质量COCO风格图像描述 ✓ 简洁、事实性字幕（3-25词） ✓ 无推测性表述（如“可能”“似乎”“表明”等） ✓ 符合COCO标准的统一字幕质量 ✓ 针对视觉语言模型训练优化 ## 数据格式每个数据分片采用WebDataset格式存储图像-文本对： shard-00000.tar ├── 000000004.jpg # 图像文件 ├── 000000004.json # 元数据（包含链接、标识符、状态、重字幕等信息） ├── 000000008.jpg ├── 000000008.json └── ... ### JSON元数据结构 json { "url": "https://example.com/image.jpg", "key": "000000004", "status": "success", "error_message": null, "width": 768, "height": 512, "exif": "{}", "original_width": 930, "original_height": 620, "recaption": "Camera gear, including lenses, batteries, and a drone controller, is meticulously arranged on a wooden floor." } ## 使用方法 ### 使用WebDataset加载 python import webdataset as wds dataset = wds.WebDataset( 'pipe:cat cc12m-coco-{00000..00597}.tar' ).decode('pil').to_tuple('jpg', 'json') for img, meta in dataset: caption = meta['recaption'] print(caption) ### 使用Hugging Face Datasets加载 python from datasets import load_dataset dataset = load_dataset('undefined443/cc12m-wds-coco-recaptioned') ## 重字幕生成流程字幕通过以下配置自动生成： - **模型**：NVIDIA Nemotron Nano 12B v2 VL（120亿参数） - **生成提示词**："撰写一句不超过20词的事实性句子，描述图像中的主要主体与动作。直接以主体开头，请勿以‘这张图片’‘这张照片’或‘此图像’开头。保持简洁客观。" - **调用API**：NVIDIA NIM API（nvidia/nemotron-nano-12b-v2-vl） - **质量验证规则**：字幕需符合以下COCO风格质量标准： - 字数范围为3-25词 - 不得使用“这/此图片/照片/图像/截图”作为开头 - 无推测性表述（如“可能”“似乎”“表明”“或许”等） - 无Markdown格式或换行符 ## 统计信息 - **总图像数**：3,000,000+ - **成功生成字幕数**：2,986,571 - **失败/跳过样本数**：1 - **生成成功率**：99.9999% - **平均字幕长度**：约15词 - **最短字幕长度**：3词 - **最长字幕长度**：25词 ## 文件组织结构本数据集共分为598个数据分片： - 分片文件命名格式为`cc12m-coco-00000.tar` 至 `cc12m-coco-00597.tar` - 单个分片大小约为550-570 MB - 未压缩总大小约为330 GB ## 引用规范若使用本数据集，请引用以下文献： bibtex @dataset{cc12m-wds-coco-recaptioned, title={CC12M WebDataset with COCO-style Recaptions}, author={Xiao Li}, year={2026}, howpublished={url{https://huggingface.co/datasets/undefined443/cc12m-wds-coco-recaptioned}} } 同时请引用原始CC12M数据集： bibtex @inproceedings{changpinyo2021conceptual, title={Conceptual 12M: Pushing web-scale image-text pre-training by disentangling visual and language representations}, author={Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={3558--3568}, year={2021} } ## 许可协议本数据集的重字幕采用与原始CC12M数据集相同的许可协议。请尊重原始图像的版权及合法使用权限。 ## 免责声明本数据集为衍生数据集，原始图像与字幕源自CC12M，重字幕通过视觉语言模型自动生成。尽管已采取措施保障字幕质量，但仍可能存在部分不准确或不完善的情况。对于关键应用场景，使用者应自行验证字幕的准确性。 ## 联系方式若有数据集相关的问题、疑问或反馈，请在Hugging Face官方仓库中提交Issue。

提供机构：

undefined443

5,000+

优质数据集

54 个

任务类型

进入经典数据集