Chinese-abbreviation-dataset
收藏arXiv2017-12-18 更新2024-06-21 收录
下载链接:
https://github.com/lancopku/Chinese-abbreviation-dataset
下载链接
链接失效反馈官方服务:
资源简介:
本数据集名为‘Chinese-abbreviation-dataset’,由北京大学计算语言学教育部重点实验室创建,专注于中文缩写预测。数据集包含10,786个全形表达,其中8,015个为正全形,2,661个为负全形。创建过程中,数据来源于人民日报语料库和SIGHAN分词语料库,经过预处理和标注,确保数据的可靠性和实用性。该数据集主要用于支持中文缩写预测研究,特别是解决那些没有有效缩写的全形表达(负全形)的问题,从而提高语言处理任务的性能。
This dataset is named 'Chinese-abbreviation-dataset', developed by the Key Laboratory of Computational Linguistics (Ministry of Education) at Peking University, and focuses on Chinese abbreviation prediction. It contains 10,786 full-form expressions, among which 8,015 are positive full-forms and 2,661 are negative full-forms. During the dataset construction, the data was sourced from the People's Daily Corpus and the SIGHAN Word Segmentation Corpus, and underwent preprocessing and annotation to ensure its reliability and practicality. This dataset is primarily used to support research on Chinese abbreviation prediction, particularly addressing the issue of full-form expressions without valid abbreviations (negative full-forms), so as to improve the performance of natural language processing tasks.
提供机构:
北京大学
创建时间:
2017-12-18



