QinboZhang/cc12m-1mp_plus-realistic

Name: QinboZhang/cc12m-1mp_plus-realistic
Creator: QinboZhang
Published: 2026-04-06 03:02:58
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/QinboZhang/cc12m-1mp_plus-realistic

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: - 100K<n<1M --- # cc12m-1mp_plus-realistic A filtering down of the full CC12M dataset, to have the following characteristics: 1. At least 1024x1024 pixels in size 2. "Realistic". No paintings, digital art, monochrome, or surreal stuff. Also discard multi-image as much as possible 3. Ideally, no signed or watermarked images. (but there will certainly be some left) ## Captions The caption types available are a bit different from some of our other ones. Currently available are: * **caption_llava** (long llava32b) * **caption_llava_short** (distilled shorter version of above) * **objects** (distilled version to pull out JUST OBJECTS. Its kinda like wd14 tagging. kinda.) ## Dataset composition Unlike other of our datasets, it includes "All decent images at least 1mp in size from CC12M", whereas certain other of our datasets clump things by size, so our "2mp" does not include things 4mp and larger. Disclaimer: some of them may not be great. Presume at least 1% poor quality, as opposed to the more general case cc12m which is much, much higher. ## Why create this when we already have the 2mp+ dataset? I find that the image count in our 2mp+ datasets is not enough for my current needs, so I had to dig deeper. # Creation and quality This dataset is NOT HAND-FILTERED. I took a human-augmented batch approach, roughly as follows: 1. Attempted to download the full CC12M dataset, after filtering out some known bad sites. Actually available images came out to around 5 million 2. Threw out everything smaller than 1mp. That left around 770k 3. Threw out some more watermarked-sites I discovered 4. Filtered out certain magic keywords like "digital artwork", and many more This currently leaves around 630k of images. Hope this is useful to people. # See also It is worth noting that some people seem to have uploaded the Actual Images to huggingface. eg: https://huggingface.co/datasets/ooutlierr/cc12m-recaptioned The issues there are: 1. The indexing scheme is not easily compatible with what I use here 2. I'm not sure what is the most efficient way to actually download, extract, and use those images. 3. The copyright usage there is dubious. I would imagine that copying those images and putting them up on hf publically, is most likely a copyright violation of some percentage of them.

提供机构：

QinboZhang

5,000+

优质数据集

54 个

任务类型

进入经典数据集