five

undefined443/cc12m-wds-recaption

收藏
Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/undefined443/cc12m-wds-recaption
下载链接
链接失效反馈
官方服务:
资源简介:
--- title: CC12M with Enhanced Captions license: other license_name: cc12m license_link: https://github.com/google-research-datasets/conceptual-12m/blob/main/LICENSE language: - en tags: - image-text - captions - multimodal - vision-language - qwen-vl - recaption task_categories: - image-to-text - text-to-image task_ids: - image-captioning pretty_name: CC12M Enhanced Captions size_categories: - 1M<n<10M configs: - config_name: default data_files: - split: train path: data.parquet --- # CC12M with Enhanced Captions This dataset contains 1.3 million image-text pairs from the CC12M dataset with model-generated captions. ## Dataset Details - **Total Samples**: 1,306,239 - **Source**: [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) - **Captioning Model**: Qwen/Qwen3-VL-8B-Instruct - **Format**: Parquet ## Filtering Criteria Samples were filtered based on the following quality metrics: - **Aesthetic Score**: >= 5.5 (using LAION aesthetic classifier) - **Resolution**: >= 512 pixels (width or height) - **Aspect Ratio**: <= 2.0 ## Dataset Schema | Column | Type | Description | |--------|------|-------------| | `key` | string | Original sample identifier | | `width` | int32 | Image width in pixels | | `height` | int32 | Image height in pixels | | `aesthetic_score` | float32 | LAION aesthetic quality score | | `caption` | string | Model-generated image description | ## Usage ```python import pandas as pd from datasets import Dataset # Load from parquet df = pd.read_parquet('train.parquet') print(df.head()) # Or use with HuggingFace datasets library from datasets import load_dataset dataset = load_dataset('undefined443/cc12m-wds-recaption') ``` ## Citation If you use this dataset, please cite the original CC12M paper: ```bibtex @article{changpinyo2021cc12m, title={Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author={Changpinyo, Soravit and Sharma, Ashwin and Chai, Yinxiao and Cheng, Li and Cottore, Gustavo and Jiang, Nanfei and Jin, Han and Kembhavi, Aniruddha and Krishna, Ranjay and Najdenkoska, Ivona and Parisi, German and others}, journal={arXiv preprint arXiv:2102.08981}, year={2021} } ``` ## License This dataset inherits the license from the original CC12M dataset. Please refer to the [CC12M license terms](https://github.com/google-research-datasets/conceptual-12m/blob/main/LICENSE) for usage restrictions.
提供机构:
undefined443
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作