gigant/tib
收藏数据集概述
名称: TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records
语言: 英语
任务: 抽象摘要
数据集大小: 训练集7282个样本,验证集910个样本,测试集911个样本
数据集结构:
- 特征:
doi: 字符串title: 字符串url: 字符串video_url: 字符串license: 字符串subject: 字符串genre: 字符串release_year: 字符串author: 字符串contributors: 字符串abstract: 字符串transcript: 字符串transcript_segments: 序列,包含id,seek,start,end,text,tokens,temperature,avg_logprob,compression_ratio,no_speech_probkeyframes: 序列,包含slide,frames,timestamplanguage: 字符串
- 数据分割:
train: 7282个样本valid: 910个样本test: 911个样本
数据集来源: 德国国家科学技术图书馆(TIB)档案
数据集创建:
- 初始数据收集: 通过爬取TIB-AV门户网站收集视频记录
- 数据过滤: 移除非英语摘要或转录,以及重复的摘要
数据集维护者: Théo Gigant, Frédéric Dufaux, Camille Guinaudeau, Marc Decombas
引用信息:
@inproceedings{gigant:hal-04168911, TITLE = {{TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records}}, AUTHOR = {GIGANT, Théo and Dufaux, Frédéric and Guinaudeau, Camille and Decombas, Marc}, URL = {https://hal.science/hal-04168911}, BOOKTITLE = {{Proc. 20th International Conference on Content-based Multimedia Indexing (CBMI 2023)}}, ADDRESS = {Orléans, France}, ORGANIZATION = {{ACM}}, YEAR = {2023}, MONTH = Sep, KEYWORDS = {multimedia dataset, multimodal documents, automatic summarization}, HAL_ID = {hal-04168911}, HAL_VERSION = {v1}, }




