hkcancor
收藏Opencsg2024-07-19 更新2025-05-03 收录
下载链接:
https://www.opencsg.com/datasets/AIWizards/hkcancor
下载链接
链接失效反馈官方服务:
资源简介:
香港粤语语料库(HKCanCor)包含1997年3月至1998年8月期间录制的转录对话,包括自发语音和广播节目,包含约23万个汉字。该语料库以词为单位进行分词,每个词都标注了词性(POS)和粤语拼音。其数据规模约为1万条对话,主要用于翻译、文本生成和对话建模等任务。语料库中的文本来源于原始录音,由专家进行标注,并采用CC-BY 4.0授权许可。数据集中每个实例都包含对话ID、发言人ID、轮次编号、PRF和UD2.0格式的词性标签,以及汉字和LSHK格式的拼音。
Hong Kong Cantonese Corpus (HKCanCor) contains transcribed conversations recorded between March 1997 and August 1998, covering spontaneous speech and broadcast programs, with approximately 230,000 Chinese characters in total. This corpus is word-segmented, with each word annotated with part-of-speech (POS) tags and Cantonese phonetic transcriptions. It has a dataset size of around 10,000 dialogue instances, and is mainly applied to tasks such as machine translation, text generation and dialogue modeling. The texts in this corpus are derived from original audio recordings, manually annotated by experts, and released under the CC-BY 4.0 license. Each instance in the dataset includes dialogue ID, speaker ID, turn number, part-of-speech tags in both PRF and UD 2.0 formats, as well as Chinese characters and Cantonese phonetic transcriptions in LSHK format.
创建时间:
2024-07-19
搜集汇总
数据集介绍

背景与挑战
背景概述
香港粤语语料库(HKCanCor)包含1997年至1998年录制的约23万汉字的粤语对话,涵盖自发语音和广播节目,以词为单位分词并标注词性和粤语拼音。该数据集适用于翻译、文本生成和对话建模等任务,采用CC-BY 4.0授权许可。
以上内容由遇见数据集搜集并总结生成



