five

GigaWord

收藏
魔搭社区2025-10-23 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/GigaWord
下载链接
链接失效反馈
官方服务:
资源简介:
displayName: GigaWord labelTypes: - Text - English Corpus license: - MIT mediaTypes: - Text paperUrl: https://arxiv.org/abs/1709.05475 publishDate: "2015" publishUrl: https://deepai.org/dataset/gigaword tags: - Text taskTypes: - Natural Language Generation - Text Summarization/Simplication --- # 数据集介绍 ## 简介 在 Gigaword 的文章对语料库上生成标题,其中包含约 400 万篇文章。 ## 引文 ``` @inproceedings{matsumaru-etal-2020-improving, title = "Improving Truthfulness of Headline Generation", author = "Matsumaru, Kazuki and Takase, Sho and Okazaki, Naoaki", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.acl-main.123", doi = "10.18653/v1/2020.acl-main.123", pages = "1335--1346", abstract = "Most studies on abstractive summarization report ROUGE scores between system and reference summaries. However, we have a concern about the truthfulness of generated summaries: whether all facts of a generated summary are mentioned in the source text. This paper explores improving the truthfulness in headline generation on two popular datasets. Analyzing headlines generated by the state-of-the-art encoder-decoder model, we show that the model sometimes generates untruthful headlines. We conjecture that one of the reasons lies in untruthful supervision data used for training the model. In order to quantify the truthfulness of article-headline pairs, we consider the textual entailment of whether an article entails its headline. After confirming quite a few untruthful instances in the datasets, this study hypothesizes that removing untruthful instances from the supervision data may remedy the problem of the untruthful behaviors of the model. Building a binary classifier that predicts an entailment relation between an article and its headline, we filter out untruthful instances from the supervision data. Experimental results demonstrate that the headline generation model trained on filtered supervision data shows no clear difference in ROUGE scores but remarkable improvements in automatic and manual evaluations of the generated headlines.", } ``` ## Download dataset :modelscope-code[]{type="git"}

displayName: GigaWord labelTypes: - 文本 - 英语语料库 license: - MIT许可证 mediaTypes: - 文本 paperUrl: https://arxiv.org/abs/1709.05475 publishDate: 2015 publishUrl: https://deepai.org/dataset/gigaword tags: - 文本 taskTypes: - 自然语言生成(Natural Language Generation) - 文本摘要/简化(Text Summarization/Simplication) --- # 数据集介绍 ## 简介 GigaWord语料库包含约400万篇文章,可用于基于其中的单篇文章生成对应标题。 ## 引文 @inproceedings{matsumaru-etal-2020-improving, title = "Improving Truthfulness of Headline Generation(提升标题生成的真实性)", author = "Matsumaru, Kazuki and Takase, Sho and Okazaki, Naoaki", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(第58届计算语言学协会年会论文集)", month = 7月, year = 2020, address = "线上", publisher = "Association for Computational Linguistics(计算语言学协会)", url = "https://aclanthology.org/2020.acl-main.123", doi = "10.18653/v1/2020.acl-main.123", pages = "1335--1346", abstract = "当前绝大多数抽象式摘要相关研究均报告了系统生成摘要与参考摘要间的ROUGE指标得分,但现有研究往往忽略了生成摘要的真实性问题:即生成摘要中的所有事实是否均源自源文本。本文针对两个主流数据集探索了标题生成任务中的真实性提升方法。通过分析当前最优的编码器-解码器模型(encoder-decoder model)生成的标题,我们发现该模型有时会生成不具备真实性的标题。我们推测,这一问题的成因之一在于训练模型所用的监督数据本身存在不真实的样本。为量化文章-标题配对数据的真实性,我们采用文本蕴含(textual entailment)的思路,判断源文章是否蕴含其对应标题的全部事实。在确认数据集中存在大量不真实样本后,本文提出假设:从监督数据中移除不真实样本可有效改善模型生成不真实内容的问题。我们构建了一个用于预测文章与标题间蕴含关系的二分类器(binary classifier),以此从监督数据中过滤掉不真实样本。实验结果表明,基于过滤后监督数据训练的标题生成模型,其ROUGE指标得分虽无显著变化,但在自动评估与人工评估中均实现了生成标题质量的显著提升。", } ## 数据集下载 # 可通过ModelScope提供的Git类型代码块获取本数据集
提供机构:
maas
创建时间:
2024-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作