LCSTS

OpenDataLab2026-04-12 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/LCSTS

下载链接

链接失效反馈

资源简介：

自动文本摘要被广泛认为是一个高度困难的问题，部分原因是缺乏大型文本摘要数据集。由于构建大规模全文摘要的巨大挑战，我们介绍了从中国微博网站新浪微博构建的大规模中文短文本摘要数据集。该语料库包含超过 200 万条真实的中文短文本，每个文本的作者都给出了简短的摘要。我们还手动标记了 10,666 个简短摘要及其相应短文本的相关性。在语料库的基础上，我们引入循环神经网络进行摘要生成并取得了可喜的结果，这不仅表明了所提出的语料库对短文本摘要研究的有用性，而且为该主题的进一步研究提供了基线。

Automatic text summarization is widely regarded as an extremely challenging problem, partially due to the scarcity of large-scale text summarization datasets. Given the formidable challenges in constructing large-scale full-text summarization datasets, we introduce a large-scale Chinese short text summarization corpus constructed from the Chinese microblogging platform Sina Weibo. This corpus contains over 2 million authentic Chinese short texts, each accompanied by a concise summary provided by the original author of the text. Additionally, we manually annotated the relevance between 10,666 pairs of short summaries and their corresponding source short texts. Based on this corpus, we conducted summarization generation experiments using recurrent neural networks (RNNs) and achieved promising results. This not only demonstrates the utility of the proposed corpus for short text summarization research but also provides a reliable baseline for further studies on this topic.

提供机构：

OpenDataLab

创建时间：

2022-08-16

搜集汇总

数据集介绍

背景与挑战

背景概述

LCSTS是一个大规模中文短文本摘要数据集，包含超过200万条来自新浪微博的真实中文短文本，每条文本都配有作者提供的简短摘要，并手动标注了10,666个摘要的相关性。该数据集由哈尔滨工业大学深圳研究生院于2015年发布，旨在解决自动文本摘要研究中缺乏大规模中文数据的问题，为短文本摘要生成提供了基准和资源支持。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集