YTSEG
收藏arXiv2024-02-27 更新2024-06-21 收录
下载链接:
https://huggingface.co/datasets/retkowski/ytseg
下载链接
链接失效反馈官方服务:
资源简介:
YTSEG是一个专注于视频转录结构化的新型基准数据集,由卡尔斯鲁厄理工学院开发。该数据集包含19,299个英语YouTube视频及其转录和章节,旨在评估文本分割系统在处理较少结构化和更多样化内容时的性能。数据集通过使用yt-dlp工具收集视频、转录和章节,并进行了预处理以适应文本分割任务。YTSEG不仅支持单模态分析,还为多模态方法铺平了道路。该数据集的应用领域广泛,包括科学、生活、政治、健康、经济和技术等多个领域,旨在解决实际应用中遇到的文本分割挑战,特别是在处理非结构化内容时。
YTSEG is a novel benchmark dataset focused on video transcript structuralization, developed by the Karlsruhe Institute of Technology. It contains 19,299 English YouTube videos along with their transcripts and chapters, aiming to evaluate the performance of text segmentation systems when handling less structured and more diverse content. The dataset is collected using the yt-dlp tool for videos, transcripts and chapters, and preprocessed to adapt to text segmentation tasks. YTSEG not only supports unimodal analysis, but also paves the way for multimodal approaches. It covers a wide range of application fields including science, daily life, politics, health, economy and technology, and is designed to address text segmentation challenges encountered in real-world applications, particularly when processing unstructured content.
提供机构:
卡尔斯鲁厄理工学院
创建时间:
2024-02-27



