Chinese Paragraph-level Topic Structure corpus (CPTS)
收藏arXiv2024-03-26 更新2024-06-21 收录
下载链接:
https://github.com/fjiangAI/CPTS
下载链接
链接失效反馈官方服务:
资源简介:
CPTS是由深圳大数据研究院构建的中文段落级主题结构数据集,包含约14393篇文档。该数据集通过两阶段人机协同标注方法构建,确保高质量。数据集内容涵盖新闻文档,每篇文档根据主题被划分为多个段落,每个段落围绕特定主题。创建过程中,首先使用自动提取方法初步确定主题边界和内容,然后由人工验证者进行验证,以确保主题结构的正确性。CPTS主要应用于主题分割和概要生成等自然语言处理任务,帮助快速理解和定位长文档中的信息。
CPTS is a Chinese paragraph-level topic structure dataset constructed by Shenzhen Institute of Big Data, containing approximately 14,393 documents. It is built using a two-stage human-machine collaborative annotation method to ensure high data quality. The dataset covers news documents, where each document is divided into multiple paragraphs based on its core topics, with each paragraph centering on a specific theme. During the construction process, automatic extraction methods are first used to preliminarily determine topic boundaries and corresponding content, followed by verification from human annotators to ensure the correctness of the topic structure. CPTS is mainly applied to natural language processing tasks such as topic segmentation and summary generation, helping users quickly understand and locate information within long documents.
提供机构:
深圳大数据研究院
创建时间:
2023-05-24



