SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization

Name: SM-MrHiSum and SM-VideoXum Datasets for Script-driven Multimodal Video Summarization
Creator: Zenodo
Published: 2026-05-06 10:53:44
License: 暂无描述

DataCite Commons2026-05-06 更新2026-05-07 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.17294444

下载链接

链接失效反馈

官方服务：

资源简介：

The SM-MrHiSum and SM-VideoXum are two large-scale datasets suitable for training and evaluation of methods for script-driven multimodal video summarization. The original MrHiSum dataset (Sul et al., 2024) was constructed from a curated subset of YouTube-8M videos, where highlight annotations were derived from YouTube’s “Most Replayed” statistics. These video replay statistics, aggregated from at least 50 unique viewers per video, serve as a reliable indicator of audience engagement. Each video was annotated at the frame level with importance scores, representing highlight intensity. Ground-truth video summaries were generated based on a predefined temporal segmentation of the videos and by solving the Knapsack problem for a given time-budget about the summary duration, ensuring that the obtained summaries are concise while covering key highlights. In total, the dataset contains 31,892 videos and the associated ground-truth annotations, supporting the training and evaluation of methods for video highlight detection and summarization. To make MrHiSum suitable for script-driven multimodal video summarization, we extended it by producing textual descriptions of the human-annotated summaries and extracting audio transcripts, forming the SM-MrHiSum dataset. For this, the visual content of each ground-truth video summary (sampled at 1 fps) was described by Qwen3-VL-8B-Instruct which was prompted to "describe the scenery and the main persons and activities shown in the video". Audio transcripts were extracted through a two-step pipeline: the speech was isolated from background noise using a pretrained model of Silero for voice activity detection, and then speech-to-text was performed using a pretrained model of Whisper, which outputs a series of timestamped transcripts. The created SM-MrHiSum dataset contains 29,917 videos, where each video is associated with: a) ground-truth summary, b) a textual description of this summary, and c) a set of timestamped audio transcripts. The SM-VideoXum dataset is an extension of the VideoXum dataset for cross-modal video summarization, that is suitable for training and evaluation of methods for script-driven multimodal video summarization. The multiple ground-truth summaries that are available per video of VideoXum, were associated with textual descriptions of their visual content, generated using Qwen3-VL-8B-Instruct and prompting it to "describe the scenery and the main persons and activities shown in the video". Moreover, audio transcripts were extracted from the full-length videos following the approach described above for the videos of the SM-MrHiSum dataset. The created SM-VideoXum dataset contains 11,908 videos, where each video is associated with: a) 10 ground-truth summaries, b) 10 textual descriptions of its summaries (one description per summary), and c) a set of timestamped audio transcripts. In our implementations and experiments, all the visual, textual, and transcript data of the SM-MrHiSum and SM-VideoXum datasets have been represented using CLIP-based embeddings. The details of the scripts, embeddings and all other data that we release as part of this repository are reported in SD-MVSum_Datasets_readme.md More information on the released datasets, along with technical details of the SD-MVSum script-driven multimodal video summarization method that we developed, can be found in the following preprint: https://arxiv.org/abs/2510.05652

提供机构：

Zenodo

创建时间：

2025-10-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集