INDOSUM
收藏arXiv2019-03-20 更新2024-06-21 收录
下载链接:
https://github.com/kata-ai/indosum
下载链接
链接失效反馈官方服务:
资源简介:
INDOSUM是一个专为印尼语文本摘要设计的大型基准数据集,由Kata.ai研究团队创建。该数据集包含约19000篇新闻文章及其人工构建的摘要,数据量远超同领域其他印尼语摘要数据集。数据集的创建过程涉及从在线新闻聚合器Shortir获取文章,并通过自然语言处理技术进行预处理和标注。INDOSUM的应用领域主要集中在印尼语文本摘要研究,旨在通过提供大规模、高质量的数据集,推动印尼语摘要技术的发展和标准化。
INDOSUM is a large-scale benchmark dataset tailored for Indonesian text summarization, developed by the research team at Kata.ai. This dataset comprises approximately 19,000 news articles paired with their human-written summaries, and its scale significantly outperforms other existing Indonesian-language summarization datasets within the same research domain. The construction of INDOSUM involves collecting articles from the online news aggregator Shortir, followed by preprocessing and annotation using natural language processing (NLP) techniques. Primarily applied in Indonesian text summarization research, INDOSUM aims to advance the development and standardization of Indonesian summarization technologies by providing a large-scale, high-quality benchmark dataset.
提供机构:
Kata.ai 雅加达,印度尼西亚
创建时间:
2018-10-12



