Jyuhamdik/RealSyn15M

Name: Jyuhamdik/RealSyn15M
Creator: Jyuhamdik
Published: 2025-12-18 01:33:17
License: 暂无描述

Hugging Face2025-12-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Jyuhamdik/RealSyn15M

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: raw_image_url dtype: string - name: text1 dtype: string - name: text2 dtype: string - name: text3 dtype: string - name: syn_text dtype: string splits: - name: train num_bytes: 8729944000 num_examples: 10000 download_size: 8729944000 configs: - config_name: default data_files: - split: train path: data/train-* license: mit --- <img src="Figure/logo_crop.png" width="15%"> # [ACM MM25]*RealSyn*: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm <a href="https://github.com/GaryGuTC">Tiancheng Gu</a>, <a href="https://kaicheng-yang0828.github.io">Kaicheng Yang</a>, Chaoyi Zhang, Yin Xie, <a href="https://github.com/anxiangsir">Xiang An</a>, Ziyong Feng, <a href="https://scholar.google.com/citations?user=JZzb8XUAAAAJ&hl=zh-CN">Dongnan Liu</a>, <a href="https://weidong-tom-cai.github.io/">Weidong Cai</a>, <a href="https://jiankangdeng.github.io">Jiankang Deng</a> [![Static Badge](https://img.shields.io/badge/github-RealSyn_Dataset-blue?style=social)](https://github.com/deepglint/RealSyn) [![Static Badge](https://img.shields.io/badge/arxiv-2502.12513-blue)](https://arxiv.org/pdf/2502.12513) ## 💡 Introduction <img src="Figure/motivation.jpg" width="45%"> Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. <img src="Figure/data_filter.jpg" width="75%"> To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. <img src="Figure/framework.jpg" width="50%"> Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct *RealSyn*, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that *RealSyn* effectively advances vision-language representation learning and exhibits strong scalability. ## 💻 Dataset Information ### Topic Assessment <img src="Figure/tsne.jpg" width="75%"> We ran LDA on random sampling 1M image-realistic text pairs with 30 topics. The above figure presents the proportions and examples for six topics: animal, food, airplane, flower, automotive, and landmark. ### Richness Assessment <img src="Figure/Richness.png" width="50%"> We presents image-text similarity and text token distribution of 15M samples from YFCC15, LAION, *RealSyn*-R1 (the most relevant retrieved realistic text), and *RealSyn*-S1 (the semantic augmented synthetic text based on *RealSyn*-R1). ### Diversity Assessment <img src="Figure/diversity_analysis.png" width="50%"> We randomly select 0.2M samples to calculate the number of unique entities in the caption to assess the data diversity of different datasets. ## 📃 Performance Comparison ### Linear probe <img src="Figure/linearprobe.jpg" width="85%"> ### Zero-shot Transfer <img src="Figure/transfer.jpg" width="85%"> ### Zero-shot Retrieval <img src="Figure/retrieval.jpg" width="75%"> ## Dataset Contributors This project would not have been possible without the invaluable contributions of the following individuals, who have been instrumental in data scraping and collection: | Contributor | Emial | |------------------|----------| | **Bin Qin** | skyqin@gmail.com | | **Lan Wu** | bah-wl@hotmail.com | ## Citation If you find this repository useful, please use the following BibTeX entry for citation. ```latex @misc{gu2025realsyn, title={RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm}, author={Tiancheng Gu and Kaicheng Yang and Chaoyi Zhang and Yin Xie and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng}, year={2025}, eprint={2502.12513}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```

提供机构：

Jyuhamdik

5,000+

优质数据集

54 个

任务类型

进入经典数据集