cc12m-a_woman
收藏魔搭社区2025-12-04 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/cc12m-a_woman
下载链接
链接失效反馈官方服务:
资源简介:
# Description
This dataset is a convenience subset of https://huggingface.co/datasets/opendiffusionai/cc12m-cleaned/
I did a quick grep for "A woman", and then HAND-CURATED the results.
That means I threw out anything with watermarks, site branding, or pretty much anything else I deemed would
get in the way of ML training.
I also only chose images that had clear, sharp camera focus on the main subject. So these are high-quality images.
At present, I have only done a few thousand.
I hope to work my way up to 100,000 images
To acquire the images, download both the ".gz" file, and the "crawl.sh" file.
Then, after having installed the img2dataset utility, with "pip install img2dataset"
you can run the crawl.sh script
You probably want to edit the script to fit your needs though.
## See also
For those people who might prefer the information in parquet format for some reason: There is an interesting huggingface hack.
It auto-compiles a parquet file with the data. See
https://huggingface.co/datasets/opendiffusionai/cc12m-a_woman/tree/refs%2Fconvert%2Fparquet/default/train
# 数据集描述
本数据集为 https://huggingface.co/datasets/opendiffusionai/cc12m-cleaned/ 的便捷子集。
我首先通过grep命令快速检索关键词“A woman”,随后对检索结果进行了人工精筛。具体而言,我剔除了所有带有水印、网站品牌标识,或是其他任何可能干扰机器学习(Machine Learning)训练的内容;同时仅选取了主体对焦清晰锐利的图像,因此本数据集的图像均为高质量素材。
截至目前,我仅完成了数千张图像的筛选工作,后续计划逐步扩充至10万张图像。
若需获取图像,请同时下载“.gz”压缩文件与“crawl.sh”脚本文件。先通过`pip install img2dataset`命令安装img2dataset工具,随后即可运行该crawl.sh脚本。请注意,您可能需要根据自身需求对脚本进行调整。
## 参阅
若部分用户希望以Parquet格式获取本数据集,可使用一项实用的拥抱脸(Hugging Face)技巧:该技巧可自动编译生成包含数据集内容的Parquet文件,具体请参见:
https://huggingface.co/datasets/opendiffusionai/cc12m-a_woman/tree/refs%2Fconvert%2Fparquet/default/train
提供机构:
maas
创建时间:
2024-12-04



