five

Chinese-abbreviation-dataset

收藏
arXiv2017-12-18 更新2024-06-21 收录
下载链接:
https://github.com/lancopku/Chinese-abbreviation-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集名为‘Chinese-abbreviation-dataset’,由北京大学计算语言学教育部重点实验室创建,专注于中文缩写预测。数据集包含10,786个全形表达,其中8,015个为正全形,2,661个为负全形。创建过程中,数据来源于人民日报语料库和SIGHAN分词语料库,经过预处理和标注,确保数据的可靠性和实用性。该数据集主要用于支持中文缩写预测研究,特别是解决那些没有有效缩写的全形表达(负全形)的问题,从而提高语言处理任务的性能。

This dataset is named 'Chinese-abbreviation-dataset', developed by the Key Laboratory of Computational Linguistics (Ministry of Education) at Peking University, and focuses on Chinese abbreviation prediction. It contains 10,786 full-form expressions, among which 8,015 are positive full-forms and 2,661 are negative full-forms. During the dataset construction, the data was sourced from the People's Daily Corpus and the SIGHAN Word Segmentation Corpus, and underwent preprocessing and annotation to ensure its reliability and practicality. This dataset is primarily used to support research on Chinese abbreviation prediction, particularly addressing the issue of full-form expressions without valid abbreviations (negative full-forms), so as to improve the performance of natural language processing tasks.
提供机构:
北京大学
创建时间:
2017-12-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作