Perseus
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/learnitboy/perseus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Perseus,是首个针对长文档的跨语言摘要(CLS)数据集,包含了大约94,000对中文学术文档及其对应的英文摘要。这些文档的平均长度超过2000个词汇,训练集、验证集和测试集的压缩比分别为14.3、14.4和14.3。数据集被划分为82,000个训练样本、6,000个验证样本以及6,000个测试样本,其任务是跨语言摘要。
The dataset named Perseus is the first cross-lingual summarization (CLS) dataset targeting long documents. It contains approximately 94,000 pairs of Chinese academic documents and their corresponding English summaries. The average length of these documents exceeds 2,000 words. The compression ratios of the training, validation, and test sets are 14.3, 14.4, and 14.3 respectively. The dataset is divided into 82,000 training samples, 6,000 validation samples, and 6,000 test samples, and its task is cross-lingual summarization.
提供机构:
Authors of the paper



