Multi-Source-Video-Captioning
收藏魔搭社区2025-11-07 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/DAMO-NLP-SG/Multi-Source-Video-Captioning
下载链接
链接失效反馈官方服务:
资源简介:
# Multi-source Video Captioning (MSVC) Dataset Card
## Dataset details
**Dataset type:**
MSVC is a set of collected video captioning data. It is constructed to ensure a robust and thorough evaluation of Video-LLMs' video-captioning capabilities.
**Dataset detail:**
MSVC is introduced to address limitations in existing video caption benchmarks, MSVC samples a total of 1,500 videos with human-annotated captions from [MSVD](https://www.aclweb.org/anthology/P11-1020/), [MSRVTT](http://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf), and [VATEX](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf), ensuring diverse scenarios and domains.
Traditional evaluation metrics rely on exact match statistics between generated and ground truth captions, which are limited in capturing the richness of video content. Thus, we use a ChatGPT-assisted evaluation similar to [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/quantitative_evaluation/README.md). Both generated and human-annotated captions are evaluated by GPT-3.5-turbo (0613) for Correctness of Information and Detailed Orientation.
It is worth noting that each video in the MSVC benchmark is annotated with multiple human-written captions, covering different aspects of the video. This comprehensive annotation ensures a robust and thorough evaluation of Video-LLMs.
**Data instructions**
Please download the raw videos from their official websites and arrange them in the following structure:
```bash
VideoLLaMA2
├── eval
│ ├── MSVC
| | ├── msvd/
| | | ├── lw7pTwpx0K0_38_48.avi
| | | └── ...
| | ├── msrvtt/
| | | ├── video9921.mp4
| | | └── ...
| | ├── vatex/
| | | ├── 9giWHf6Pf24.mp4
| | | └── ...
```
**GPT3.5 Evaluation Prompt:**
```python
# Correctness evaluation:
{
"role": "system",
"content":
"You are an intelligent chatbot designed for evaluating the factual accuracy of generative outputs for video-based question-answer pairs. "
"Your task is to compare the predicted answer with these correct answers and determine if they are factually consistent. Here's how you can accomplish the task:"
"------"
"##INSTRUCTIONS: "
"- Focus on the factual consistency between the predicted answer and the correct answer. The predicted answer should not contain any misinterpretations or misinformation.\n"
"- The predicted answer must be factually accurate and align with the video content.\n"
"- Consider synonyms or paraphrases as valid matches.\n"
"- Evaluate the factual accuracy of the prediction compared to the answer."
},
{
"role": "user",
"content":
"Please evaluate the following video-based question-answer pair:\n\n"
f"Question: {question}\n"
f"Correct Answers: {answer}\n"
f"Predicted Answer: {pred}\n\n"
"Provide your evaluation only as a factual accuracy score where the factual accuracy score is an integer value between 0 and 5, with 5 indicating the highest level of factual consistency. "
"Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the factual accuracy score in INTEGER, not STRING."
"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
"For example, your response should look like this: {''score': 4.8}."
}
```
```python
# Detailedness evaluation:
{
"role": "system",
"content": "You are an intelligent chatbot designed for evaluating the detail orientation of generative outputs for video-based question-answer pairs. "
"Your task is to compare the predicted answer with these correct answers and determine its level of detail, considering both completeness and specificity. Here's how you can accomplish the task:"
"------"
"##INSTRUCTIONS: "
"- Check if the predicted answer covers all major points from the video. The response should not leave out any key aspects.\n"
"- Evaluate whether the predicted answer includes specific details rather than just generic points. It should provide comprehensive information that is tied to specific elements of the video.\n"
"- Consider synonyms or paraphrases as valid matches.\n"
"- Provide a single evaluation score that reflects the level of detail orientation of the prediction, considering both completeness and specificity.",
},
{
"role": "user",
"content": "Please evaluate the following video-based question-answer pair:\n\n"
f"Question: {question}\n"
f"Correct Answers: {answer}\n"
f"Predicted Answer: {pred}\n\n"
"Provide your evaluation only as a detail orientation score where the detail orientation score is an integer value between 0 and 5, with 5 indicating the highest level of detail orientation. "
"Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the detail orientation score in INTEGER, not STRING."
"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
"For example, your response should look like this: {''score': 4.8}.",
}
```
**Dataset date:**
MSVC was released in June 2024.
**Paper or resources for more information:**
https://github.com/DAMO-NLP-SG/VideoLLaMA2
**Where to send questions or comments about the model:**
https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues
## Citation
If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}
```
## Intended use
**Primary intended uses:**
The primary use of MSVC is research on Video-LLMs.
**Primary intended users:**
The primary intended users of the dataset are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
# 多源视频字幕(Multi-source Video Captioning, MSVC)数据集卡片
## 数据集详情
**数据集类型**:
MSVC是一套采集得到的视频字幕数据集,其构建目标是为视频大语言模型(Video-LLMs)的视频字幕生成能力提供全面且可靠的评估基准。
**数据集详细信息**:
为弥补现有视频字幕基准数据集的局限性,本数据集从[MSVD](https://www.aclweb.org/anthology/P11-1020/)、[MSRVTT](http://openaccess.thecvf.com/content_cvpr_2016/papers/Xu_MSR-VTT_A_Large_CVPR_2016_paper.pdf)与[VATEX](http://openaccess.thecvf.com/content_ICCV_2019/papers/Wang_VaTeX_A_Large-Scale_High-Quality_Multilingual_Dataset_for_Video-and-Language_Research_ICCV_2019_paper.pdf)三个公开数据集中共采样1500条带人工标注字幕的视频,覆盖多样化的应用场景与领域。
传统评估指标多依赖生成字幕与基准真值字幕间的精确匹配统计,难以充分捕捉视频内容的丰富性。因此,本数据集采用类似VideoChatGPT的ChatGPT辅助评估方案,通过GPT-3.5-turbo(0613版本)对生成字幕与人工标注字幕分别从信息正确性与细节完备性两个维度进行评估。
值得注意的是,MSVC基准中的每条视频均配有多条人工撰写的字幕,覆盖视频的不同维度,这种全面的标注方式可为视频大语言模型的评估提供可靠且充分的依据。
**数据说明**:
请从各数据集的官方网站下载原始视频,并按照以下目录结构进行组织:
bash
VideoLLaMA2
├── eval
│ ├── MSVC
| | ├── msvd/
| | | ├── lw7pTwpx0K0_38_48.avi
| | | └── ...
| | ├── msrvtt/
| | | ├── video9921.mp4
| | | └── ...
| | ├── vatex/
| | | ├── 9giWHf6Pf24.mp4
| | | └── ...
**GPT-3.5 评估提示词**:
python
# 正确性评估:
{
"role": "system",
"content":
"You are an intelligent chatbot designed for evaluating the factual accuracy of generative outputs for video-based question-answer pairs. "
"Your task is to compare the predicted answer with these correct answers and determine if they are factually consistent. Here's how you can accomplish the task:"
"------"
"##INSTRUCTIONS: "
"- Focus on the factual consistency between the predicted answer and the correct answer. The predicted answer should not contain any misinterpretations or misinformation.
"
"- The predicted answer must be factually accurate and align with the video content.
"
"- Consider synonyms or paraphrases as valid matches.
"
"- Evaluate the factual accuracy of the prediction compared to the answer."
},
{
"role": "user",
"content":
"Please evaluate the following video-based question-answer pair:
"
f"Question: {question}
"
f"Correct Answers: {answer}
"
f"Predicted Answer: {pred}
"
"Provide your evaluation only as a factual accuracy score where the factual accuracy score is an integer value between 0 and 5, with 5 indicating the highest level of factual consistency. "
"Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the factual accuracy score in INTEGER, not STRING."
"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
"For example, your response should look like this: {'score': 4}."
}
python
# 细节完备性评估:
{
"role": "system",
"content": "You are an intelligent chatbot designed for evaluating the detail orientation of generative outputs for video-based question-answer pairs. "
"Your task is to compare the predicted answer with these correct answers and determine its level of detail, considering both completeness and specificity. Here's how you can accomplish the task:"
"------"
"##INSTRUCTIONS: "
"- Check if the predicted answer covers all major points from the video. The response should not leave out any key aspects.
"
"- Evaluate whether the predicted answer includes specific details rather than just generic points. It should provide comprehensive information that is tied to specific elements of the video.
"
"- Consider synonyms or paraphrases as valid matches.
"
"- Provide a single evaluation score that reflects the level of detail orientation of the prediction, considering both completeness and specificity.",
},
{
"role": "user",
"content": "Please evaluate the following video-based question-answer pair:
"
f"Question: {question}
"
f"Correct Answers: {answer}
"
f"Predicted Answer: {pred}
"
"Provide your evaluation only as a detail orientation score where the detail orientation score is an integer value between 0 and 5, with 5 indicating the highest level of detail orientation. "
"Please generate the response in the form of a Python dictionary string with keys 'score', where its value is the detail orientation score in INTEGER, not STRING."
"DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
"For example, your response should look like this: {'score': 4}.",
}
**数据集发布时间**:
MSVC于2024年6月发布。
**更多信息与资源**:
https://github.com/DAMO-NLP-SG/VideoLLaMA2
**反馈渠道**:
如有关于本数据集的疑问或建议,请提交至:https://github.com/DAMO-NLP-SG/VideoLLaMA2/issues
## 引用
若您的研究与应用中使用了MSVC相关工作,请使用以下BibTeX格式进行引用:
bibtex
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal={arXiv preprint arXiv:2306.02858},
year={2023},
url = {https://arxiv.org/abs/2306.02858}
}
## 预期用途
**主要用途**:
MSVC的核心用途为视频大语言模型相关研究。
**目标用户**:
本数据集的目标用户为计算机视觉、自然语言处理、机器学习与人工智能领域的研究人员与爱好者。
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍

背景与挑战
背景概述
MSVC是一个包含1,500个多样化视频的字幕数据集,用于全面评估视频-语言模型的字幕生成能力,采用GPT-3.5-turbo进行辅助评估。
以上内容由遇见数据集搜集并总结生成



