PS5k

Name: PS5k
Creator: Manually curated
Published: 2025-09-30T13:37:52+08:00

arXiv2025-09-30 收录

文本摘要

文档处理

数据链接：

https://github.com/atharsefid/extractive_research_slide_generation_using_windowed_labeling_ranking 数据链接链接失效反馈

官方服务：

资源简介：

该数据集包含了来自不同会议进程网站的5000多份论文与幻灯片配对，旨在训练科学文档的摘要模型。平均而言，每个演示文稿包含35页幻灯片，每页幻灯片有8行文本。该数据集用于训练摘要模型，并已分为训练集（4500对）、验证集（250对）和测试集（250对）。规模达到了5000对论文与幻灯片的配对，任务是对科学文档进行摘要。

This dataset contains over 5,000 paper-slide pairs sourced from various conference proceedings websites, designed for training scientific document summarization models. On average, each presentation consists of 35 slides, with 8 lines of text per slide. This dataset is utilized for training summarization models and has been split into the training set (4,500 pairs), validation set (250 pairs) and test set (250 pairs). It has a total scale of over 5,000 paper-slide pairs, and the task is to summarize scientific documents.

提供机构：

Manually curated

搜集汇总

数据集介绍