five

Puffin-4M

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Puffin-4M
下载链接
链接失效反馈
官方服务:
资源简介:
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation** <p align="center"> &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp| &nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2510.08673">Paper </a> &nbsp&nbsp <br> ## Dataset Details Datasets and benchmarks that span vision, language, and camera modalities remain scarce in the domain of spatial multimodal intelligence. To address this gap, we introduce **Puffin-4M**, a large-scale, high-quality dataset comprising 4 million vision-language-camera triplets. Puffin-4M includes single-view images with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations across diverse indoor and outdoor scenarios. Beyond single views, it also incorporates cross-view and aesthetic images, making it a versatile benchmark for both understanding and generation tasks. <p align="center"> <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%"> </p> | | | |---|---| | **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy | | **Affiliation** | S-Lab, Nanyang Technological University | | **First released** | arXiv pre-print, 2025 | | **Dataset type** | Camera-centric understanding and generation | | **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text | --- ## Dataset Samples We show the samples of our **Puffin-4M** for each task (camera-centric generation and understanding, world exploration, spatial imagination, and photographic guidance) as follows. <p align="center"> <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%"> </p> ### Directory Structure ``` DATA_PATH/ ├─ training data/ │ ├─ cap_folder/ # captions, including scene descriptions and camera parameters │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ cap_folder_cot/ # captions with thinking, including spatial reasoning descriptions and camera parameters │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ local_folder/ # images │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ summary.json │ ├─ cross_view/ # instruction tuning data for world exploration and spatial imagination │ │ ├─ cap_folder/ # captions, including text descriptions and camera parameters │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_cam/ # captions, only including camera parameters │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_scene/ # captions, only including scene descriptions │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder/ # target views │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder_init/ # initial views │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ summary.json │ ├─ photography/ # instruction tuning data for photographic guidance │ │ ├─ cap_folder/ # captions, only including camera parameters │ │ │ ├─ 000000.tar.gz │ │ ├─ local_folder/ # images │ │ │ ├─ 000000.tar.gz │ │ ├─ summary.json ├─ benchmark/ │ ├─ Puffin-Und/ │ │ ├─ images/ │ │ │ ├─ 0000001.jpg │ │ │ ├─ ... │ │ ├─ cameras.csv │ ├─ Puffin-Gen/ │ │ ├─ caption/ │ │ │ ├─ caption_src/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_degree/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_photographic_term/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ ├─ camera/ │ │ │ ├─ 0000001.pt │ │ │ ├─ ... │ │ ├─ cameras.csv └─ README.md ``` ### Dataset Download You can download the entire Puffin-4M dataset using the following command: ```bash hf download KangLiao/Puffin-4M --repo-type dataset ``` The whole dataset (training data and benchmark) is approximately **449GB** in size. Note that we omit the camera maps from the uploaded training data due to their large total size (~3 MB each, amounting to ~11.4 TB in total). However, these maps can be easily generated using the provided script ```scripts/camera/cam_dataset.py``` available on our [GitHub repository](https://github.com/KangLiao929/Puffin). ### Citation If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX: ```bibtex @article{liao2025puffin, title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation}, author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change}, journal={arXiv preprint arXiv:2510.08673}, year={2025} } ``` ### License This project is licensed under [NTU S-Lab License 1.0](LICENSE).

# **以相机为思维载体:面向相机中心理解与生成的统一多模态模型** <p align="center"> &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">项目页面</a>&nbsp&nbsp|&nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2510.08673">论文</a>&nbsp&nbsp <br> ## 数据集详情 在空间多模态智能领域,涵盖视觉、语言与相机模态的数据集与基准测试集仍较为匮乏。为填补这一空白,我们推出**Puffin-4M**——一个包含400万组视觉-语言-相机三元组的大规模高质量数据集。 Puffin-4M涵盖多样室内外场景下的单视图图像(附带精确相机参数)、描述性字幕、逐像素相机映射图以及空间推理标注。除单视图数据外,该数据集还包含跨视图图像与美学图像,可作为适用于理解与生成两类任务的通用基准测试集。 <p align="center"> <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset.png?raw=true" alt="Puffin-4M" width="100%"> </p> | | | |---|---| | **开发团队** | 廖康(Kang Liao)、吴思(Size Wu)、吴中华(Zhonghua Wu)、金琳怡(Linyi Jin)、王超(Chao Wang)、王艺凯(Yikai Wang)、王飞(Fei Wang)、李伟(Wei Li)、陈启峰(Chen Change Loy) | | **所属机构** | 南洋理工大学S-Lab | | **首次发布** | 2025年arXiv预印本 | | **数据集类型** | 相机中心理解与生成 | | **模态类型** | 图像 → 文本+相机;文本+相机 → 图像;图像+相机 → 图像;图像+相机 → 文本 | --- ## 数据集样本 如下展示了**Puffin-4M**在各类任务上的样本,涵盖相机中心生成与理解、场景探索、空间想象以及摄影指导等任务。 <p align="center"> <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/dataset_samples.png?raw=true" alt="Puffin-4M-samples" width="100%"> </p> ### 目录结构 DATA_PATH/ ├─ training data/ │ ├─ cap_folder/ # 字幕文件,包含场景描述与相机参数 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ cap_folder_cot/ # 带推理过程的字幕文件,包含空间推理描述与相机参数 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ local_folder/ # 图像文件 │ │ ├─ 000000.tar.gz │ │ └─ ... │ ├─ summary.json │ ├─ cross_view/ # 用于场景探索与空间想象的指令微调数据 │ │ ├─ cap_folder/ # 字幕文件,包含文本描述与相机参数 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_cam/ # 仅包含相机参数的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ cap_folder_scene/ # 仅包含场景描述的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder/ # 目标视角图像 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ local_folder_init/ # 初始视角图像 │ │ │ ├─ 000000.tar.gz │ │ │ └─ ... │ │ ├─ summary.json │ ├─ photography/ # 用于摄影指导的指令微调数据 │ │ ├─ cap_folder/ # 仅包含相机参数的字幕文件 │ │ │ ├─ 000000.tar.gz │ │ ├─ local_folder/ # 图像文件 │ │ │ ├─ 000000.tar.gz │ │ ├─ summary.json ├─ benchmark/ │ ├─ Puffin-Und/ │ │ ├─ images/ │ │ │ ├─ 0000001.jpg │ │ │ ├─ ... │ │ ├─ cameras.csv │ ├─ Puffin-Gen/ │ │ ├─ caption/ │ │ │ ├─ caption_src/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_degree/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ │ ├─ caption_photographic_term/ │ │ │ │ ├─ 0000001.json │ │ │ │ ├─ ... │ │ ├─ camera/ │ │ │ ├─ 0000001.pt │ │ │ ├─ ... │ │ ├─ cameras.csv └─ README.md ### 数据集下载 可通过以下命令下载完整的Puffin-4M数据集: bash hf download KangLiao/Puffin-4M --repo-type dataset 完整数据集(包含训练数据与基准测试集)总大小约为**449GB**。需注意的是,由于单张相机映射图体积约为3MB,总存储量高达约11.4TB,因此我们在上传的训练数据中未包含该类文件。但用户可通过GitHub仓库中提供的`scripts/camera/cam_dataset.py`脚本快速生成所需的相机映射图。 ### 引用格式 若您的研究或应用中用到了Puffin数据集,请通过以下BibTeX引用我们的论文: bibtex @article{liao2025puffin, title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation}, author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change}, journal={arXiv preprint arXiv:2510.08673}, year={2025} } ### 开源许可 本项目采用[NTU S-Lab License 1.0](LICENSE)开源许可协议。
提供机构:
maas
创建时间:
2025-10-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作