five

VL3-Syn7M

收藏
魔搭社区2025-12-05 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/DAMO-NLP-SG/VL3-Syn7M
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/0Xqwn1fhUByfjm-PmSyyW.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center">The re-caption dataset used in <a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3> <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5> ## 🌟 Introduction This dataset is the re-captioned data we used during the training of VideoLLaMA3. It consists of 7 million diverse, high-quality images, each accompanied by a short caption and a detailed caption. The images in this dataset originate from [COYO-700M](https://github.com/kakaobrain/coyo-dataset), [MS-COCO 2017](https://cocodataset.org/#home), [CC-3M](https://ai.google.com/research/ConceptualCaptions/), and [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain), with captions re-annotated using [InternVL2](https://huggingface.co/collections/OpenGVLab/internvl20-667d3961ab5eb12c7ed1463e). To get more information about VL3-Syn7M, please read our [paper](https://arxiv.org/abs/2501.13106). ## 🤖 Quick Start We provide all information about our dataset in `data.jsonl`. For each image, we provide a `url` key representing the image link and a `data_source` key indicating its source. Additionally, the `original_id` key contains the image's ID in the original dataset. Finally, the `detailed_caption` and `short_caption` keys provide the detailed annotation and short annotation of the image, respectively. ## Citation If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX: ```bibtex @article{damonlpsg2025videollama3, title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding}, author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao}, journal={arXiv preprint arXiv:2501.13106}, year={2025}, url = {https://arxiv.org/abs/2501.13106} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} } @article{damonlpsg2023videollama, title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding}, author = {Zhang, Hang and Li, Xin and Bing, Lidong}, journal = {arXiv preprint arXiv:2306.02858}, year = {2023}, url = {https://arxiv.org/abs/2306.02858} } ```

<p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/0Xqwn1fhUByfjm-PmSyyW.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center">用于<a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3:面向视频理解的前沿多模态基础模型</a>的重标注字幕数据集</h3> <h5 align="center">如果您喜爱我们的项目,请前往<a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">GitHub</a>为我们点亮⭐以获取最新更新。</h5> ## 🌟 简介 本数据集为我们在训练VideoLLaMA3过程中所使用的重标注字幕数据集,包含700万张多样化高质量图像,每张图像均配有简短字幕与详细字幕。 本数据集的图像源自[COYO-700M](https://github.com/kakaobrain/coyo-dataset)、[MS-COCO 2017](https://cocodataset.org/#home)、[CC-3M](https://ai.google.com/research/ConceptualCaptions/)以及[LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain),并使用[InternVL2](https://huggingface.co/collections/OpenGVLab/internvl2.0-667d3961ab5eb12c7ed1463e)对字幕进行了重新标注。 如需了解VL3-Syn7M的更多详情,请查阅我们的[论文](https://arxiv.org/abs/2501.13106)。 ## 🤖 快速上手 我们将数据集的全部信息存储于`data.jsonl`文件中。对于每张图像,我们提供了代表图像链接的`url`字段,以及标识其来源的`data_source`字段。此外,`original_id`字段包含该图像在原始数据集中的ID。最后,`detailed_caption`与`short_caption`字段分别对应该图像的详细标注与简短标注。 ## 引用 若您的研究或应用中用到了VideoLLaMA相关成果,请使用以下BibTeX格式进行引用: bibtex @article{damonlpsg2025videollama3, title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding}, author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao}, journal={arXiv preprint arXiv:2501.13106}, year={2025}, url = {https://arxiv.org/abs/2501.13106} } @article{damonlpsg2024videollama2, title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs}, author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong}, journal={arXiv preprint arXiv:2406.07476}, year={2024}, url = {https://arxiv.org/abs/2406.07476} } @article{damonlpsg2023videollama, title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding}, author = {Zhang, Hang and Li, Xin and Bing, Lidong}, journal = {arXiv preprint arXiv:2306.02858}, year={2023}, url = {https://arxiv.org/abs/2306.02858} }
提供机构:
maas
创建时间:
2025-02-08
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
VL3-Syn7M是一个包含700万张高质量图片的数据集,每张图片配有简短和详细的描述,用于训练VideoLLaMA3。图片来源于多个公开数据集,描述由InternVL2重新标注。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作