Amshaker/Mobile-O-Pre-Train
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Amshaker/Mobile-O-Pre-Train
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-to-image
- image-to-text
tags:
- mobile-o
- multimodal
- pretraining
- cross-modal-alignment
pretty_name: Mobile-O Pre-Training Data
size_categories:
- 1M<n<10M
---
<div align="center">
<h1>
<img src="https://github.com/Amshaker/Mobile-O/blob/main/assets/mobile-o-logo.png?raw=true" width="30" /> Mobile-O Pre-Training Data
</h1>
**Cross-Modal Alignment · 9M Text-Image Pairs**
<p>
<a href="https://arxiv.org/abs/2602.20161"><img src="https://img.shields.io/badge/arXiv-2602.20161-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/Amshaker/Mobile-O"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="Code"></a>
<a href="https://amshaker.github.io/Mobile-O/"><img src="https://img.shields.io/badge/🌐-Project_Page-2563eb.svg" alt="Project Page"></a>
<a href="https://huggingface.co/collections/Amshaker/mobile-o-models"><img src="https://img.shields.io/badge/🤗-Models-yellow.svg" alt="Models"></a>
<a href="https://mobileo.cvmbzuai.com/"><img src="https://img.shields.io/badge/🚀-Live_Demo-10b981.svg" alt="Live Demo"></a>
</p>
</div>
## 📌 Overview
This dataset is used for **Stage 1: Cross-Modal Alignment** pre-training of [Mobile-O](https://github.com/Amshaker/Mobile-O), a unified multimodal model for on-device understanding and generation.
The goal of this stage is to align the DiT diffusion decoder and Mobile Conditioning Projector (MCP) with the frozen VLM backbone using large-scale text-image pairs.
## 📊 Dataset Composition
| Source | Samples | Description |
|--------|:-------:|-------------|
| [JourneyDB](https://journeydb.github.io/) | 4M | High-quality AI-generated images with captions |
| [BLIP3o-Pretrain-Short-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption) | 5M | Each image paired with a short caption generated by Qwen/Qwen2.5-VL-7B-Instruct |
## 🏋️ Training Details
- **Stage:** 1 — Cross-Modal Alignment (Pre-training)
- **Trainable components:** DiT + Mobile Conditioning Projector (MCP)
- **Frozen components:** Visual encoders, LLM backbone, VAE
- **Script:** `pretrain.sh`
## 🔗 Related Resources
| Resource | Link |
|----------|------|
| 📄 Paper | [arXiv](https://arxiv.org/abs/XXXX.XXXXX) |
| 💻 Code | [GitHub](https://github.com/Amshaker/Mobile-O) |
| 🤗 SFT Data | [Mobile-O-SFT](https://huggingface.co/datasets/Amshaker/Mobile-O-SFT) |
| 🤗 Post-Training Data | [Mobile-O-Post-Train](https://huggingface.co/datasets/Amshaker/Mobile-O-Post-Train) |
| 🤗 Model (0.5B) | [Mobile-O-0.5B](https://huggingface.co/Amshaker/Mobile-O-0.5B) |
| 🤗 Model (1.5B) | [Mobile-O-1.5B](https://huggingface.co/Amshaker/Mobile-O-1.5B) |
## 📄 Citation
```bibtex
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:2602.20161},
year={2026}
}
```
## 🙏 Acknowledgments
We gratefully acknowledge the following datasets used in constructing this pre-training corpus:
- [JourneyDB](https://journeydb.github.io/)
- [BLIP3o-Pretrain-Short-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption)
license: CC-BY-NC-4.0
任务类别:
- 文本到图像
- 图像到文本
标签:
- mobile-o
- 多模态(multimodal)
- 预训练(pretraining)
- 跨模态对齐(cross-modal-alignment)
展示名称:Mobile-O预训练数据集
样本规模:100万 < 样本数 < 1000万
---
<div align="center">
<h1>
<img src="https://github.com/Amshaker/Mobile-O/blob/main/assets/mobile-o-logo.png?raw=true" width="30" /> Mobile-O 预训练数据集
</h1>
**跨模态对齐(cross-modal alignment) · 900万图文对**
<p>
<a href="https://arxiv.org/abs/2602.20161"><img src="https://img.shields.io/badge/arXiv-2602.20161-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/Amshaker/Mobile-O"><img src="https://img.shields.io/badge/GitHub-Code-black.svg" alt="代码"></a>
<a href="https://amshaker.github.io/Mobile-O/"><img src="https://img.shields.io/badge/🌐-项目页面-2563eb.svg" alt="项目页面"></a>
<a href="https://huggingface.co/collections/Amshaker/mobile-o-models"><img src="https://img.shields.io/badge/🤗-模型-yellow.svg" alt="模型"></a>
<a href="https://mobileo.cvmbzuai.com/"><img src="https://img.shields.io/badge/🚀-在线演示-10b981.svg" alt="在线演示"></a>
</p>
</div>
## 📌 概述
本数据集用于**第一阶段:跨模态对齐(cross-modal alignment)**预训练,对应面向端侧理解与生成的统一多模态模型(multimodal model)[Mobile-O](https://github.com/Amshaker/Mobile-O)。
本阶段的核心目标是利用大规模图文对,将扩散图像转换器(DiT)扩散解码器与移动条件投影器(Mobile Conditioning Projector, MCP)与冻结的视觉语言模型(VLM)主干完成对齐。
## 📊 数据集构成
| 来源 | 样本数量 | 描述 |
|--------|:-------:|-------------|
| [JourneyDB](https://journeydb.github.io/) | 400万 | 带高质量标注的AI生成图像 |
| [BLIP3o-Pretrain-Short-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption) | 500万 | 每张图像搭配由Qwen/Qwen2.5-VL-7B-Instruct生成的简短标注 |
## 🏋️ 训练细节
- **阶段:** 1 — 跨模态对齐(预训练)
- **可训练组件:** DiT + 移动条件投影器(Mobile Conditioning Projector, MCP)
- **冻结组件:** 视觉编码器、大语言模型(LLM/Large Language Model)主干、变分自编码器(VAE)
- **训练脚本:** `pretrain.sh`
## 🔗 相关资源
| 资源类型 | 链接 |
|----------|------|
| 📄 论文 | [arXiv](https://arxiv.org/abs/XXXX.XXXXX) |
| 💻 代码 | [GitHub](https://github.com/Amshaker/Mobile-O) |
| 🤗 监督微调数据集 | [Mobile-O-SFT](https://huggingface.co/datasets/Amshaker/Mobile-O-SFT) |
| 🤗 后训练数据集 | [Mobile-O-Post-Train](https://huggingface.co/datasets/Amshaker/Mobile-O-Post-Train) |
| 🤗 0.5亿参数模型 | [Mobile-O-0.5B](https://huggingface.co/Amshaker/Mobile-O-0.5B) |
| 🤗 1.5亿参数模型 | [Mobile-O-1.5B](https://huggingface.co/Amshaker/Mobile-O-1.5B) |
## 📄 引用
bibtex
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:2602.20161},
year={2026}
}
## 🙏 致谢
我们对构建本预训练语料所使用的以下数据集表示诚挚感谢:
- [JourneyDB](https://journeydb.github.io/)
- [BLIP3o-Pretrain-Short-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption)
提供机构:
Amshaker



