cantonese-chinese

Name: cantonese-chinese
Creator: maas
Published: 2025-11-17 16:42:31
License: 暂无描述

魔搭社区2025-11-17 更新2025-03-15 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/cantonese-chinese

下载链接

链接失效反馈

官方服务：

资源简介：

# Cantonese-Mandarin-Traditional Chinese Parallel Corpus This dataset provides a parallel corpus of Cantonese, Simplified Chinese, and Traditional Chinese text. ## Dataset Composition The dataset is a combination of two existing datasets: 1. [botisan-ai/cantonese-mandarin-translations](https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations) 2. [raptorkwok/cantonese-chinese-dataset-gen2](https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2) **Train Set:** Merged from both source datasets **Test and Validation Sets:** Derived from raptorkwok/cantonese-chinese-dataset-gen2 ## Language Variants The dataset contains three language variants: 1. **yue:** Spoken Cantonese (Yue) rendered into words 2. **zh:** Simplified Chinese 3. **zht:** Traditional Chinese as written in Hong Kong ## Conversion Process The conversion from Simplified Chinese to Traditional Chinese was performed using [StarCC](https://github.com/StarCC0/spec). ## Data Format The dataset is structured in JSON format, with each entry containing parallel text in the three language variants. Example entry: ```json { "yue": "講唔到重點而且唔夠全面", "zh": "说不了重点而且不够全面", "zht": "說不了重點而且不夠全面" } ``` ## Use Cases This dataset can be valuable for various natural language processing tasks, including: - Machine translation between Cantonese, Simplified Chinese, and Traditional Chinese - Comparative linguistic studies of Chinese language variants - Development of multilingual Chinese language models ## Limitations Users should be aware that automatic conversion between Simplified and Traditional Chinese, while generally reliable, may not always capture differences in vocabulary and idioms between regions using different writing systems. ## Licence and Acknowledgements Contains data from - "botisan-ai/cantonese-mandarin-translations" by Liang, H., https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations, licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0). - "raptorkwok/cantonese-chinese-dataset-gen2" by Raptor K, available at https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2, licensed under CC0.

# 粤语-普通话-繁体中文平行语料库本数据集提供粤语、简体中文与繁体中文的平行语料库。 ## 数据集构成本数据集由两个现有数据集合并构建而成： 1. [botisan-ai/cantonese-mandarin-translations](https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations) 2. [raptorkwok/cantonese-chinese-dataset-gen2](https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2) **训练集**：由两个源数据集合并得到 **测试集与验证集**：源自raptorkwok/cantonese-chinese-dataset-gen2 ## 语言变体本数据集包含三种语言变体： 1. **yue**：以书面形式转写的口语粤语（粤方言） 2. **zh**：简体中文 3. **zht**：香港地区使用的书面繁体中文 ## 转换流程简体中文向繁体中文的转换通过[StarCC](https://github.com/StarCC0/spec)完成。 ## 数据格式本数据集采用JSON格式组织，每条数据均包含三种语言变体的平行文本。示例条目如下： json { "yue": "講唔到重點而且唔夠全面", "zh": "说不了重点而且不够全面", "zht": "說不了重點而且不夠全面" } ## 应用场景本数据集可应用于多种自然语言处理任务，包括： - 粤语、简体中文与繁体中文之间的机器翻译 - 汉语变体的比较语言学研究 - 多语言汉语大语言模型（Large Language Model，LLM）的开发 ## 局限性用户需注意，尽管简体与繁体中文的自动转换通常较为可靠，但未必能完全覆盖不同书写体系地区间的词汇与习语差异。 ## 授权与致谢本数据集包含以下来源的数据： - "botisan-ai/cantonese-mandarin-translations" 由Liang, H.制作，详见https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations，采用CC BY-NC-SA 4.0许可协议（https://creativecommons.org/licenses/by-nc-sa/4.0）。 - "raptorkwok/cantonese-chinese-dataset-gen2" 由Raptor K制作，详见https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2，采用CC0许可协议。

提供机构：

maas

创建时间：

2025-03-12

搜集汇总

数据集介绍