PS5k
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/atharsefid/extractive_research_slide_generation_using_windowed_labeling_ranking
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了来自不同会议进程网站的5000多份论文与幻灯片配对,旨在训练科学文档的摘要模型。平均而言,每个演示文稿包含35页幻灯片,每页幻灯片有8行文本。该数据集用于训练摘要模型,并已分为训练集(4500对)、验证集(250对)和测试集(250对)。规模达到了5000对论文与幻灯片的配对,任务是对科学文档进行摘要。
This dataset contains over 5,000 paper-slide pairs sourced from various conference proceedings websites, designed for training scientific document summarization models. On average, each presentation consists of 35 slides, with 8 lines of text per slide. This dataset is utilized for training summarization models and has been split into the training set (4,500 pairs), validation set (250 pairs) and test set (250 pairs). It has a total scale of over 5,000 paper-slide pairs, and the task is to summarize scientific documents.
提供机构:
Manually curated
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集用于提取式研究幻灯片生成任务,基于学术论文和对应幻灯片构建。它包含从论文中提取的句子文件(如*.sents.txt)和窗口化标签文件(如*.windowed_summarunner_scores.txt),用于训练和评估机器学习模型,支持训练、验证和测试分割。
以上内容由遇见数据集搜集并总结生成



