LCSTS

Name: LCSTS
Creator: maas
Published: 2026-05-13 20:56:52
License: 暂无描述

魔搭社区2026-05-13 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/opencompass/LCSTS

下载链接

链接失效反馈

官方服务：

更多采购需求

资源简介：

# LCSTS ### Introduction LCSTS is a Large-scale Chinese Short Text Summarization dataset constructed from the Chinese microblogging website SinaWeibo for the summary generation, which is collected by Harbin Institute of Technology. This corpus consists of over 2 million real Chinese short texts with short summaries given by the writer of each text, as well as 10,666 short summaries marked manually. ### Paper [LCSTS: A Large Scale Chinese Short Text Summarization Dataset](https://www.aclweb.org/anthology/D15-1229.pdf). EMNLP 2015. ### Data Size Training set: 2,400,591; Validation set: 8,685; Test set: 725. ### Data Format Each instance is composed of a human-labeled summary quality score (human_label, an integer), input text (text, a string) and a output summary (summary, an integer). ### Example ``` { "human_label": 5, "summary": "林志颖公司疑涉虚假营销无厂房无研发", "text": "日前，方舟子发文直指林志颖旗下爱碧丽推销假保健品，引起哗然。调查发现，爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发，全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价>高达1080元，实际成本仅为每瓶4元！" } ``` - "human_label" (`int`): the human-labeled summary quality score(Only the validation set and the test set have this label, and the data set only includes 3, 4, and 5 points data, not including 1, 2 points data.). - "text" (`str`): input text. - "summary"（`str`): a output summary. ### Evaluation Code The prediction result needs to be consistent with the format of the evaluation code. Dependency packages: rouge==1.0.0, jieba=0.42.1 ```shell python eval.py prediction_file test_private_file ``` The evaluation metrics are rouge-1, rouge-2, rouge-l, and the output is in dictionary format. ```she return { "rouge-1-f": _, "rouge-1-p": _, "rouge-1-r": _, "rouge-2-f": _, "rouge-2-p": _, "rouge-2-r": _, "rouge-l-f": _, "rouge-l-p": _, "rouge-l-r": _} ``` ### Author List Baotian Hu, Qingcai Chen, Fangze Zhu ### Institutions Harbin Institute of Technology ### Citation ``` @inproceedings{hu2015lcsts, title={LCSTS: A Large Scale Chinese Short Text Summarization Dataset}, author={Hu, Baotian and Chen, Qingcai and Zhu, Fangze}, booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, pages={1967--1972}, year={2015} } ```

# LCSTS ### 简介 LCSTS是由哈尔滨工业大学（Harbin Institute of Technology）采集构建的大规模中文短文本摘要数据集，数据源自新浪微博（SinaWeibo）中文微博平台，面向摘要生成任务构建。该语料库包含超过200万条真实中文短文本，每条文本均附带原作者撰写的短摘要，此外还包含10666条人工标注的短摘要。 ### 论文 [LCSTS: 大规模中文短文本摘要数据集](https://www.aclweb.org/anthology/D15-1229.pdf)，发表于2015年自然语言处理经验方法会议（EMNLP 2015）。 ### 数据集规模训练集：2400591条；验证集：8685条；测试集：725条。 ### 数据格式每条样本由人工标注的摘要质量评分（human_label，整数类型）、输入文本（text，字符串类型）以及输出摘要（summary，字符串类型）构成。 ### 示例 { "human_label": 5, "summary": "林志颖公司疑涉虚假营销无厂房无研发", "text": "日前，方舟子发文直指林志颖旗下爱碧丽推销假保健品，引起哗然。调查发现，爱碧丽没有自己的生产加工厂。其胶原蛋白饮品无核心研发，全部代工生产。号称有“逆生长”功效的爱碧丽“梦幻奇迹限量组”售价高达1080元，实际成本仅为每瓶4元！" } - `human_label`（整数类型）：人工标注的摘要质量评分。仅验证集与测试集包含该标签，且数据集仅收录评分3、4、5分的样本，不包含1、2分数据。 - `text`（字符串类型）：输入文本。 - `summary`（字符串类型）：输出摘要。 ### 评估代码预测结果需与评估代码的格式要求保持一致。依赖包：rouge==1.0.0、jieba==0.42.1 shell python eval.py prediction_file test_private_file 评估指标包含ROUGE-1、ROUGE-2、ROUGE-L，输出结果为字典格式。 python return { "rouge-1-f": _, "rouge-1-p": _, "rouge-1-r": _, "rouge-2-f": _, "rouge-2-p": _, "rouge-2-r": _, "rouge-l-f": _, "rouge-l-p": _, "rouge-l-r": _ } ### 作者列表胡宝田、陈清财、朱芳泽 ### 所属机构哈尔滨工业大学（Harbin Institute of Technology） ### 引用格式 @inproceedings{hu2015lcsts, title={LCSTS: A Large Scale Chinese Short Text Summarization Dataset}, author={Hu, Baotian and Chen, Qingcai and Zhu, Fangze}, booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, pages={1967--1972}, year={2015} }

提供机构：

maas

创建时间：

2024-07-02

搜集汇总

数据集介绍