Puffin-4M

Name: Puffin-4M
Creator: maas
Published: 2025-12-05 16:54:45
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/Puffin-4M

下载链接

链接失效反馈

官方服务：

资源简介：

# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation** &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp｜ &nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2510.08673">Paper </a> &nbsp&nbsp ## Dataset Details Datasets and benchmarks that span vision, language, and camera modalities remain scarce in the domain of spatial multimodal intelligence. To address this gap, we introduce **Puffin-4M**, a large-scale, high-quality dataset comprising 4 million vision-language-camera triplets. Puffin-4M includes single-view images with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations across diverse indoor and outdoor scenarios. Beyond single views, it also incorporates cross-view and aesthetic images, making it a versatile benchmark for both understanding and generation tasks. <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%"> | | | |---|---| | **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy | | **Affiliation** | S-Lab, Nanyang Technological University | | **First released** | arXiv pre-print, 2025 | | **Dataset type** | Camera-centric understanding and generation | | **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text | --- ## Dataset Samples We show the samples of our **Puffin-4M** for each task (camera-centric generation and understanding, world exploration, spatial imagination, and photographic guidance) as follows. <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%"> ### Directory Structure ``` DATA_PATH/ ├─ training data/ │ ├─ cap_folder/ # captions, including scene descriptions and camera parameters │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ cap_folder_cot/ # captions with thinking, including spatial reasoning descriptions and camera parameters │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ local_folder/ # images │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ summary.json │ ├─ cross_view/ # instruction tuning data for world exploration and spatial imagination │ │ ├─ cap_folder/ # captions, including text descriptions and camera parameters │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_cam/ # captions, only including camera parameters │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_scene/ # captions, only including scene descriptions │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder/ # target views │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder_init/ # initial views │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ summary.json │ ├─ photography/ # instruction tuning data for photographic guidance │ │ ├─ cap_folder/ # captions, only including camera parameters │ │ │ ├─ 000000.tar.gz │ │ ├─ local_folder/ # images │ │ │ ├─ 000000.tar.gz │ │ ├─ summary.json ├─ benchmark/ │ ├─ Puffin-Und/ │ │ ├─ images/ │ │ │ ├─ 0000001.jpg │ │ │ ├─ ... │ │ ├─ cameras.csv │ ├─ Puffin-Gen/ │ │ ├─ caption/ │ │ │ ├─ caption_src/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_degree/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_photographic_term/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ ├─ camera/ │ │ │ ├─ 0000001.pt │ │ │ ├─ ... │ │ ├─ cameras.csv └─ README.md ``` ### Dataset Download You can download the entire Puffin-4M dataset using the following command: ```bash hf download KangLiao/Puffin-4M --repo-type dataset ``` The whole dataset (training data and benchmark) is approximately **449GB** in size. Note that we omit the camera maps from the uploaded training data due to their large total size (~3 MB each, amounting to ~11.4 TB in total). However, these maps can be easily generated using the provided script ```scripts/camera/cam_dataset.py``` available on our [GitHub repository](https://github.com/KangLiao929/Puffin). ### Citation If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX: ```bibtex @article{liao2025puffin, title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation}, author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change}, journal={arXiv preprint arXiv:2510.08673}, year={2025} } ``` ### License This project is licensed under [NTU S-Lab License 1.0](LICENSE).

# **以相机为思维载体：面向相机中心理解与生成的统一多模态模型** &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">项目页面</a>&nbsp&nbsp｜&nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2510.08673">论文</a>&nbsp&nbsp ## 数据集详情在空间多模态智能领域，涵盖视觉、语言与相机模态的数据集与基准测试集仍较为匮乏。为填补这一空白，我们推出**Puffin-4M**——一个包含400万组视觉-语言-相机三元组的大规模高质量数据集。 Puffin-4M涵盖多样室内外场景下的单视图图像（附带精确相机参数）、描述性字幕、逐像素相机映射图以及空间推理标注。除单视图数据外，该数据集还包含跨视图图像与美学图像，可作为适用于理解与生成两类任务的通用基准测试集。 <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%"> | | | |---|---| | **开发团队** | 廖康（Kang Liao）、吴思（Size Wu）、吴中华（Zhonghua Wu）、金琳怡（Linyi Jin）、王超（Chao Wang）、王艺凯（Yikai Wang）、王飞（Fei Wang）、李伟（Wei Li）、陈启峰（Chen Change Loy） | | **所属机构** | 南洋理工大学S-Lab | | **首次发布** | 2025年arXiv预印本 | | **数据集类型** | 相机中心理解与生成 | | **模态类型** | 图像 → 文本+相机；文本+相机 → 图像；图像+相机 → 图像；图像+相机 → 文本 | --- ## 数据集样本如下展示了**Puffin-4M**在各类任务上的样本，涵盖相机中心生成与理解、场景探索、空间想象以及摄影指导等任务。 <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%"> ### 目录结构 DATA_PATH/ ├─ training data/ │ ├─ cap_folder/ # 字幕文件，包含场景描述与相机参数 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ cap_folder_cot/ # 带推理过程的字幕文件，包含空间推理描述与相机参数 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ local_folder/ # 图像文件 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ summary.json │ ├─ cross_view/ # 用于场景探索与空间想象的指令微调数据 │ │ ├─ cap_folder/ # 字幕文件，包含文本描述与相机参数 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_cam/ # 仅包含相机参数的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_scene/ # 仅包含场景描述的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder/ # 目标视角图像 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder_init/ # 初始视角图像 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ summary.json │ ├─ photography/ # 用于摄影指导的指令微调数据 │ │ ├─ cap_folder/ # 仅包含相机参数的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ ├─ local_folder/ # 图像文件 │ │ │ ├─ 000000.tar.gz │ │ ├─ summary.json ├─ benchmark/ │ ├─ Puffin-Und/ │ │ ├─ images/ │ │ │ ├─ 0000001.jpg │ │ │ ├─ ... │ │ ├─ cameras.csv │ ├─ Puffin-Gen/ │ │ ├─ caption/ │ │ │ ├─ caption_src/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_degree/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_photographic_term/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ ├─ camera/ │ │ │ ├─ 0000001.pt │ │ │ ├─ ... │ │ ├─ cameras.csv └─ README.md ### 数据集下载可通过以下命令下载完整的Puffin-4M数据集： bash hf download KangLiao/Puffin-4M --repo-type dataset 完整数据集（包含训练数据与基准测试集）总大小约为**449GB**。需注意的是，由于单张相机映射图体积约为3MB，总存储量高达约11.4TB，因此我们在上传的训练数据中未包含该类文件。但用户可通过GitHub仓库中提供的`scripts/camera/cam_dataset.py`脚本快速生成所需的相机映射图。 ### 引用格式若您的研究或应用中用到了Puffin数据集，请通过以下BibTeX引用我们的论文： bibtex @article{liao2025puffin, title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation}, author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change}, journal={arXiv preprint arXiv:2510.08673}, year={2025} } ### 开源许可本项目采用[NTU S-Lab License 1.0](LICENSE)开源许可协议。

提供机构：

maas

创建时间：

2025-10-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集