pixelprose/pixelprose-shards
收藏Hugging Face2025-12-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/pixelprose/pixelprose-shards
下载链接
链接失效反馈官方服务:
资源简介:
PixelProse-Shards是一个大规模多模态数据集,包含1680万条图像-文本-JSON三元组数据,分为commonpool(652万条)、cc12m(906万条)和redcaps(128万条)三个子集。每个压缩包文件(500-600MB)包含图像文件、原始文本描述(.txt)和包含完整元数据的JSON文件。数据由Gemini-1.0-Pro模型生成,部分数据存在重复或URL失效的情况。特别提示早期版本可能需要过滤短文本(少于50字符)。数据集适用于多模态学习和文本-图像关联任务。
PixelProse-Shards is a large-scale multimodal dataset containing 16.8 million image-text-JSON triplets, divided into three subsets: commonpool (6.52M), cc12m (9.06M) and redcaps (1.28M). Each tar file (500-600MB) contains image files, raw text captions (.txt) and JSON files with complete metadata. The data was generated by Gemini-1.0-Pro model, with some duplicated entries and broken URLs. Note that filtering short captions (<50 chars) may be needed for early versions. The dataset is suitable for multimodal learning and text-image association tasks.
提供机构:
pixelprose



