five

gigant/tib

收藏
Hugging Face2024-07-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/gigant/tib
下载链接
链接失效反馈
官方服务:
资源简介:
TIB是一个用于多模态演示文稿抽象摘要的英文数据集,包含9103个从德国国家科学与技术图书馆(TIB)档案中提取的视频会议记录及其元数据、摘要、自动处理的转录文本和关键帧。数据集分为训练集、验证集和测试集,分别包含7282、910和911个样本。每个数据点代表一个视频会议记录,包含转录文本和关键帧等文本和视觉模态,以及用作目标摘要的摘要。数据集的创建过程包括从TIB-AV门户爬取数据,并过滤掉不符合条件的记录。数据集由Théo Gigant等人创建,并在2023年的CBMI会议上发布。

TIB is an English dataset for abstractive summarization of multimodal presentations, containing 9,103 videoconference records extracted from the German National Library of Science and Technology (TIB) archive, along with their metadata, an abstract and automatically processed transcripts and key frames. The dataset is divided into training, validation, and test sets, containing 7,282, 910, and 911 samples respectively. Each data point represents a videoconference record, including textual and visual modalities such as transcripts and key frames, and an abstract used as the target summary. The dataset was created by crawling data from the TIB-AV portal and filtering out records that did not meet the criteria. The dataset was created by Théo Gigant et al. and was released at the CBMI conference in 2023.
提供机构:
gigant
原始信息汇总

数据集概述

名称: TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records

语言: 英语

任务: 抽象摘要

数据集大小: 训练集7282个样本,验证集910个样本,测试集911个样本

数据集结构:

  • 特征:
    • doi: 字符串
    • title: 字符串
    • url: 字符串
    • video_url: 字符串
    • license: 字符串
    • subject: 字符串
    • genre: 字符串
    • release_year: 字符串
    • author: 字符串
    • contributors: 字符串
    • abstract: 字符串
    • transcript: 字符串
    • transcript_segments: 序列,包含id, seek, start, end, text, tokens, temperature, avg_logprob, compression_ratio, no_speech_prob
    • keyframes: 序列,包含slide, frames, timestamp
    • language: 字符串
  • 数据分割:
    • train: 7282个样本
    • valid: 910个样本
    • test: 911个样本

数据集来源: 德国国家科学技术图书馆(TIB)档案

数据集创建:

  • 初始数据收集: 通过爬取TIB-AV门户网站收集视频记录
  • 数据过滤: 移除非英语摘要或转录,以及重复的摘要

数据集维护者: Théo Gigant, Frédéric Dufaux, Camille Guinaudeau, Marc Decombas

引用信息:

@inproceedings{gigant:hal-04168911, TITLE = {{TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records}}, AUTHOR = {GIGANT, Théo and Dufaux, Frédéric and Guinaudeau, Camille and Decombas, Marc}, URL = {https://hal.science/hal-04168911}, BOOKTITLE = {{Proc. 20th International Conference on Content-based Multimedia Indexing (CBMI 2023)}}, ADDRESS = {Orléans, France}, ORGANIZATION = {{ACM}}, YEAR = {2023}, MONTH = Sep, KEYWORDS = {multimedia dataset, multimodal documents, automatic summarization}, HAL_ID = {hal-04168911}, HAL_VERSION = {v1}, }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作