QinboZhang/cc12m-1mp_plus-realistic
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/QinboZhang/cc12m-1mp_plus-realistic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 100K<n<1M
---
# cc12m-1mp_plus-realistic
A filtering down of the full CC12M dataset, to have the following characteristics:
1. At least 1024x1024 pixels in size
2. "Realistic". No paintings, digital art, monochrome, or surreal stuff. Also discard multi-image as much as possible
3. Ideally, no signed or watermarked images. (but there will certainly be some left)
## Captions
The caption types available are a bit different from some of our other ones. Currently available are:
* **caption_llava** (long llava32b)
* **caption_llava_short** (distilled shorter version of above)
* **objects** (distilled version to pull out JUST OBJECTS. Its kinda like wd14 tagging. kinda.)
## Dataset composition
Unlike other of our datasets, it includes "All decent images at least 1mp in size from CC12M", whereas certain other of our datasets clump things
by size, so our "2mp" does not include things 4mp and larger.
Disclaimer: some of them may not be great. Presume at least 1% poor quality, as opposed to the more general case cc12m which is much, much higher.
## Why create this when we already have the 2mp+ dataset?
I find that the image count in our 2mp+ datasets is not enough for my current needs, so I had to dig deeper.
# Creation and quality
This dataset is NOT HAND-FILTERED.
I took a human-augmented batch approach, roughly as follows:
1. Attempted to download the full CC12M dataset, after filtering out some known bad sites.
Actually available images came out to around 5 million
2. Threw out everything smaller than 1mp. That left around 770k
3. Threw out some more watermarked-sites I discovered
4. Filtered out certain magic keywords like "digital artwork", and many more
This currently leaves around 630k of images. Hope this is useful to people.
# See also
It is worth noting that some people seem to have uploaded the Actual Images to huggingface. eg:
https://huggingface.co/datasets/ooutlierr/cc12m-recaptioned
The issues there are:
1. The indexing scheme is not easily compatible with what I use here
2. I'm not sure what is the most efficient way to actually download, extract, and use those images.
3. The copyright usage there is dubious. I would imagine that copying those images and putting them
up on hf publically, is most likely a copyright violation of some percentage of them.
提供机构:
QinboZhang



