VL3-Syn7M

Name: VL3-Syn7M
Creator: maas
Published: 2025-12-05 16:22:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/DAMO-NLP-SG/VL3-Syn7M

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/0Xqwn1fhUByfjm-PmSyyW.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center">The re-caption dataset used in <a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3> <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5> ## 🌟 Introduction This dataset is the re-captioned data we used during the training of VideoLLaMA3. It consists of 7 million diverse, high-quality images, each accompanied by a short caption and a detailed caption. The images in this dataset originate from [COYO-700M](https://github.com/kakaobrain/coyo-dataset), [MS-COCO 2017](https://cocodataset.org/#home), [CC-3M](https://ai.google.com/research/ConceptualCaptions/), and [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain), with captions re-annotated using [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl20-667d3961ab5eb12c7ed1463e). To get more information about VL3-Syn7M, please read our [paper](https://arxiv.org/abs/2501.13106). ## 🤖 Quick Start We provide all information about our dataset in `data.jsonl`. For each image, we provide a `url` key representing the image link and a `data_source` key indicating its source. Additionally, the `original_id` key contains the image's ID in the original dataset. Finally, the `detailed_caption` and `short_caption` keys provide the detailed annotation and short annotation of the image, respectively. ## Citation If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX: ```bibtex @article{damonlpsg2025videollama3, title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding}, author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao}, journal={arXiv preprint arXiv:2501.13106}, year={2025}, url = {https://arxiv.org/abs/2501.13106} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} } @article{damonlpsg2023videollama, title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding}, author = {Zhang, Hang and Li, Xin and Bing, Lidong}, journal = {arXiv preprint arXiv:2306.02858}, year = {2023}, url = {https://arxiv.org/abs/2306.02858} } ```

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/0Xqwn1fhUByfjm-PmSyyW.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center">用于<a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3：面向视频理解的前沿多模态基础模型</a>的重标注字幕数据集</h3> <h5 align="center">如果您喜爱我们的项目，请前往<a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">GitHub</a>为我们点亮⭐以获取最新更新。</h5> ## 🌟 简介本数据集为我们在训练VideoLLaMA3过程中所使用的重标注字幕数据集，包含700万张多样化高质量图像，每张图像均配有简短字幕与详细字幕。本数据集的图像源自[COYO-700M](https://github.com/kakaobrain/coyo-dataset)、[MS-COCO 2017](https://cocodataset.org/#home)、[CC-3M](https://ai.google.com/research/ConceptualCaptions/)以及[LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)，并使用[InternVL2](https://huggingface.co/collections/OpenGVLab/internvl2.0-667d3961ab5eb12c7ed1463e)对字幕进行了重新标注。如需了解VL3-Syn7M的更多详情，请查阅我们的[论文](https://arxiv.org/abs/2501.13106)。 ## 🤖 快速上手我们将数据集的全部信息存储于`data.jsonl`文件中。对于每张图像，我们提供了代表图像链接的`url`字段，以及标识其来源的`data_source`字段。此外，`original_id`字段包含该图像在原始数据集中的ID。最后，`detailed_caption`与`short_caption`字段分别对应该图像的详细标注与简短标注。 ## 引用若您的研究或应用中用到了VideoLLaMA相关成果，请使用以下BibTeX格式进行引用： bibtex @article{damonlpsg2025videollama3, title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding}, author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao}, journal={arXiv preprint arXiv:2501.13106}, year={2025}, url = {https://arxiv.org/abs/2501.13106} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} } @article{damonlpsg2023videollama, title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding}, author = {Zhang, Hang and Li, Xin and Bing, Lidong}, journal = {arXiv preprint arXiv:2306.02858}, year={2023}, url = {https://arxiv.org/abs/2306.02858} }

提供机构：

maas

创建时间：

2025-02-08

搜集汇总

数据集介绍