undefined443/cc12m-wds-recaption
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/undefined443/cc12m-wds-recaption
下载链接
链接失效反馈官方服务:
资源简介:
---
title: CC12M with Enhanced Captions
license: other
license_name: cc12m
license_link: https://github.com/google-research-datasets/conceptual-12m/blob/main/LICENSE
language:
- en
tags:
- image-text
- captions
- multimodal
- vision-language
- qwen-vl
- recaption
task_categories:
- image-to-text
- text-to-image
task_ids:
- image-captioning
pretty_name: CC12M Enhanced Captions
size_categories:
- 1M<n<10M
configs:
- config_name: default
data_files:
- split: train
path: data.parquet
---
# CC12M with Enhanced Captions
This dataset contains 1.3 million image-text pairs from the CC12M dataset with model-generated captions.
## Dataset Details
- **Total Samples**: 1,306,239
- **Source**: [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds)
- **Captioning Model**: Qwen/Qwen3-VL-8B-Instruct
- **Format**: Parquet
## Filtering Criteria
Samples were filtered based on the following quality metrics:
- **Aesthetic Score**: >= 5.5 (using LAION aesthetic classifier)
- **Resolution**: >= 512 pixels (width or height)
- **Aspect Ratio**: <= 2.0
## Dataset Schema
| Column | Type | Description |
|--------|------|-------------|
| `key` | string | Original sample identifier |
| `width` | int32 | Image width in pixels |
| `height` | int32 | Image height in pixels |
| `aesthetic_score` | float32 | LAION aesthetic quality score |
| `caption` | string | Model-generated image description |
## Usage
```python
import pandas as pd
from datasets import Dataset
# Load from parquet
df = pd.read_parquet('train.parquet')
print(df.head())
# Or use with HuggingFace datasets library
from datasets import load_dataset
dataset = load_dataset('undefined443/cc12m-wds-recaption')
```
## Citation
If you use this dataset, please cite the original CC12M paper:
```bibtex
@article{changpinyo2021cc12m,
title={Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts},
author={Changpinyo, Soravit and Sharma, Ashwin and Chai, Yinxiao and Cheng, Li and Cottore, Gustavo and Jiang, Nanfei and Jin, Han and Kembhavi, Aniruddha and Krishna, Ranjay and Najdenkoska, Ivona and Parisi, German and others},
journal={arXiv preprint arXiv:2102.08981},
year={2021}
}
```
## License
This dataset inherits the license from the original CC12M dataset. Please refer to the [CC12M license terms](https://github.com/google-research-datasets/conceptual-12m/blob/main/LICENSE) for usage restrictions.
提供机构:
undefined443



