Kaichengalex/RealSyn100M
收藏Hugging Face2025-07-14 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/Kaichengalex/RealSyn100M
下载链接
链接失效反馈官方服务:
资源简介:
RealSyn是一个结合了现实世界文本和合成文本的多模态数据集,旨在提升视觉语言表示学习。数据集通过现实世界数据提取管道获取高质量图像和文本,采用分层检索方法将图像与多个语义相关的现实文本关联,并通过图像语义增强生成模块来生成合成文本。数据集包含三个规模:15M、30M和100M。实验证明,RealSyn能有效推动视觉语言表示学习,并具有很好的可扩展性。
RealSyn is a multimodal dataset that combines real-world texts and synthetic texts, aiming to improve visual-language representation learning. The dataset acquires high-quality images and texts through a real-world data extraction pipeline, uses a hierarchical retrieval method to associate images with multiple semantically relevant real texts, and generates synthetic texts through an image semantic augmented generation module. The dataset comes in three sizes: 15M, 30M, and 100M. Experiments have shown that RealSyn can effectively promote visual-language representation learning and has good scalability.
提供机构:
Kaichengalex



