Puffin-4M
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Puffin-4M
下载链接
链接失效反馈官方服务:
资源简介:
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
<p align="center">
   📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>  |    🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a>    |   🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>   |    📑 <a href="https://arxiv.org/abs/2510.08673">Paper </a>   
<br>
## Dataset Details
Datasets and benchmarks that span vision, language, and camera modalities remain scarce in the domain of spatial multimodal intelligence.
To address this gap, we introduce **Puffin-4M**, a large-scale, high-quality dataset comprising 4 million vision-language-camera triplets.
Puffin-4M includes single-view images with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations across diverse indoor and outdoor scenarios.
Beyond single views, it also incorporates cross-view and aesthetic images, making it a versatile benchmark for both understanding and generation tasks.
<p align="center">
<img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%">
</p>
| | |
|---|---|
| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
| **Affiliation** | S-Lab, Nanyang Technological University |
| **First released** | arXiv pre-print, 2025 |
| **Dataset type** | Camera-centric understanding and generation |
| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
---
## Dataset Samples
We show the samples of our **Puffin-4M** for each task (camera-centric generation and understanding, world exploration, spatial imagination, and photographic guidance) as follows.
<p align="center">
<img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%">
</p>
### Directory Structure
```
DATA_PATH/
├─ training data/
│ ├─ cap_folder/ # captions, including scene descriptions and camera parameters
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ cap_folder_cot/ # captions with thinking, including spatial reasoning descriptions and camera parameters
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ local_folder/ # images
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ summary.json
│ ├─ cross_view/ # instruction tuning data for world exploration and spatial imagination
│ │ ├─ cap_folder/ # captions, including text descriptions and camera parameters
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ cap_folder_cam/ # captions, only including camera parameters
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ cap_folder_scene/ # captions, only including scene descriptions
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ local_folder/ # target views
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ local_folder_init/ # initial views
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ summary.json
│ ├─ photography/ # instruction tuning data for photographic guidance
│ │ ├─ cap_folder/ # captions, only including camera parameters
│ │ │ ├─ 000000.tar.gz
│ │ ├─ local_folder/ # images
│ │ │ ├─ 000000.tar.gz
│ │ ├─ summary.json
├─ benchmark/
│ ├─ Puffin-Und/
│ │ ├─ images/
│ │ │ ├─ 0000001.jpg
│ │ │ ├─ ...
│ │ ├─ cameras.csv
│ ├─ Puffin-Gen/
│ │ ├─ caption/
│ │ │ ├─ caption_src/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ │ ├─ caption_degree/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ │ ├─ caption_photographic_term/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ ├─ camera/
│ │ │ ├─ 0000001.pt
│ │ │ ├─ ...
│ │ ├─ cameras.csv
└─ README.md
```
### Dataset Download
You can download the entire Puffin-4M dataset using the following command:
```bash
hf download KangLiao/Puffin-4M --repo-type dataset
```
The whole dataset (training data and benchmark) is approximately **449GB** in size. Note that we omit the camera maps from the uploaded training data due to their large total size (~3 MB each, amounting to ~11.4 TB in total).
However, these maps can be easily generated using the provided script ```scripts/camera/cam_dataset.py``` available on our [GitHub repository](https://github.com/KangLiao929/Puffin).
### Citation
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
```bibtex
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}
```
### License
This project is licensed under [NTU S-Lab License 1.0](LICENSE).
# **以相机为思维载体:面向相机中心理解与生成的统一多模态模型**
<p align="center">
   📖 <a href="https://kangliao929.github.io/projects/puffin">项目页面</a>  |   🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a>   |   🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>   |    📑 <a href="https://arxiv.org/abs/2510.08673">论文</a>  
<br>
## 数据集详情
在空间多模态智能领域,涵盖视觉、语言与相机模态的数据集与基准测试集仍较为匮乏。为填补这一空白,我们推出**Puffin-4M**——一个包含400万组视觉-语言-相机三元组的大规模高质量数据集。
Puffin-4M涵盖多样室内外场景下的单视图图像(附带精确相机参数)、描述性字幕、逐像素相机映射图以及空间推理标注。除单视图数据外,该数据集还包含跨视图图像与美学图像,可作为适用于理解与生成两类任务的通用基准测试集。
<p align="center">
<img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%">
</p>
| | |
|---|---|
| **开发团队** | 廖康(Kang Liao)、吴思(Size Wu)、吴中华(Zhonghua Wu)、金琳怡(Linyi Jin)、王超(Chao Wang)、王艺凯(Yikai Wang)、王飞(Fei Wang)、李伟(Wei Li)、陈启峰(Chen Change Loy) |
| **所属机构** | 南洋理工大学S-Lab |
| **首次发布** | 2025年arXiv预印本 |
| **数据集类型** | 相机中心理解与生成 |
| **模态类型** | 图像 → 文本+相机;文本+相机 → 图像;图像+相机 → 图像;图像+相机 → 文本 |
---
## 数据集样本
如下展示了**Puffin-4M**在各类任务上的样本,涵盖相机中心生成与理解、场景探索、空间想象以及摄影指导等任务。
<p align="center">
<img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%">
</p>
### 目录结构
DATA_PATH/
├─ training data/
│ ├─ cap_folder/ # 字幕文件,包含场景描述与相机参数
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ cap_folder_cot/ # 带推理过程的字幕文件,包含空间推理描述与相机参数
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ local_folder/ # 图像文件
│ │ ├─ 000000.tar.gz
│ │ └─ ...
│ ├─ summary.json
│ ├─ cross_view/ # 用于场景探索与空间想象的指令微调数据
│ │ ├─ cap_folder/ # 字幕文件,包含文本描述与相机参数
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ cap_folder_cam/ # 仅包含相机参数的字幕文件
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ cap_folder_scene/ # 仅包含场景描述的字幕文件
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ local_folder/ # 目标视角图像
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ local_folder_init/ # 初始视角图像
│ │ │ ├─ 000000.tar.gz
│ │ │ └─ ...
│ │ ├─ summary.json
│ ├─ photography/ # 用于摄影指导的指令微调数据
│ │ ├─ cap_folder/ # 仅包含相机参数的字幕文件
│ │ │ ├─ 000000.tar.gz
│ │ ├─ local_folder/ # 图像文件
│ │ │ ├─ 000000.tar.gz
│ │ ├─ summary.json
├─ benchmark/
│ ├─ Puffin-Und/
│ │ ├─ images/
│ │ │ ├─ 0000001.jpg
│ │ │ ├─ ...
│ │ ├─ cameras.csv
│ ├─ Puffin-Gen/
│ │ ├─ caption/
│ │ │ ├─ caption_src/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ │ ├─ caption_degree/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ │ ├─ caption_photographic_term/
│ │ │ │ ├─ 0000001.json
│ │ │ │ ├─ ...
│ │ ├─ camera/
│ │ │ ├─ 0000001.pt
│ │ │ ├─ ...
│ │ ├─ cameras.csv
└─ README.md
### 数据集下载
可通过以下命令下载完整的Puffin-4M数据集:
bash
hf download KangLiao/Puffin-4M --repo-type dataset
完整数据集(包含训练数据与基准测试集)总大小约为**449GB**。需注意的是,由于单张相机映射图体积约为3MB,总存储量高达约11.4TB,因此我们在上传的训练数据中未包含该类文件。但用户可通过GitHub仓库中提供的`scripts/camera/cam_dataset.py`脚本快速生成所需的相机映射图。
### 引用格式
若您的研究或应用中用到了Puffin数据集,请通过以下BibTeX引用我们的论文:
bibtex
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}
### 开源许可
本项目采用[NTU S-Lab License 1.0](LICENSE)开源许可协议。
提供机构:
maas
创建时间:
2025-10-14



