Automatic Thai news summarization using deep learning

Name: Automatic Thai news summarization using deep learning
Creator: Thammasat University
Published: 2022-09-08 12:00:42
License: 暂无描述

DataCite Commons2022-09-08 更新2025-04-16 收录

下载链接：

http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2021.556

下载链接

链接失效反馈

官方服务：

资源简介：

Nowadays, there are a lot of textual data available on the internet, and their number is continuously growing every day. Nevertheless, the data available on internet are redundancy. Therefore, it is time-consuming and laborious to find the information for similar data manually. It is necessary to provide the better mechanism to extract the useful and significant information quickly and effectively. Text summarization is thus one of the methods that can solve such problem.This research proposed automatic Thai news summarization based on a hybrid approach for a single document. We combined both extractive and abstractive summarization approaches to improve the performance of the model. In addition, we simply augmented dataset by using the original documents of the training dataset to augment to be the input documents and summaries. The augmented economic dataset is unlabeled data. We added the unlabeled data in the training dataset because we would like to give the model to learn language model to improve the grammatical structure of the output summary. Besides, we study how the document length and word position affects the performance of the deep learning models.According to the results, we found that our proposed model obtained ROUGE-1 = 0.6456, ROUGE-2 = 0.4108, and ROUGE-L = 0.6372. The model can generate the output summary that is readable and grammatically correct. For studying the document length and word position affects the performance of the deep learning models, we found that the deep learning models can summarize a short document better than a long document. Regrading words position, the deep learning models work well in the original documents that have import words appear in the beginning of the original document.

当今互联网上可获取的文本数据规模庞大，且每日仍在持续增长。然而，互联网公开数据普遍存在冗余问题，人工筛选相似数据中的有效信息既耗时又费力，亟需一套高效快捷的机制以快速且有效地提取出有用且关键的信息。文本摘要正是解决此类问题的有效方法之一。本研究提出了一种基于混合方法的单文档泰语新闻自动摘要方案，将抽取式摘要（extractive summarization）与生成式摘要（abstractive summarization）两种方法相结合，以提升模型的整体性能。此外，我们通过将训练数据集的原始文档作为输入文档与参考摘要，完成了简易的数据增强。本次增强的经济领域数据集为无标注数据，我们将其加入训练集，旨在让模型学习语言模型知识，优化输出摘要的语法结构。除此之外，我们还探究了文档长度与词汇位置对深度学习模型性能的影响。实验结果表明，我们提出的模型取得了ROUGE-1=0.6456、ROUGE-2=0.4108、ROUGE-L=0.6372的指标结果，生成的摘要可读性强且语法正确。针对文档长度与词汇位置对模型性能的影响研究显示，深度学习模型对短文档的摘要生成效果优于长文档；在词汇位置维度，当原始文档中的重要词汇出现在文档开篇时，模型的表现更佳。

提供机构：

Thammasat University

创建时间：

2022-09-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集