five

cantonese-chinese

收藏
魔搭社区2025-11-17 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/cantonese-chinese
下载链接
链接失效反馈
官方服务:
资源简介:
# Cantonese-Mandarin-Traditional Chinese Parallel Corpus This dataset provides a parallel corpus of Cantonese, Simplified Chinese, and Traditional Chinese text. ## Dataset Composition The dataset is a combination of two existing datasets: 1. [botisan-ai/cantonese-mandarin-translations](https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations) 2. [raptorkwok/cantonese-chinese-dataset-gen2](https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2) **Train Set:** Merged from both source datasets **Test and Validation Sets:** Derived from raptorkwok/cantonese-chinese-dataset-gen2 ## Language Variants The dataset contains three language variants: 1. **yue:** Spoken Cantonese (Yue) rendered into words 2. **zh:** Simplified Chinese 3. **zht:** Traditional Chinese as written in Hong Kong ## Conversion Process The conversion from Simplified Chinese to Traditional Chinese was performed using [StarCC](https://github.com/StarCC0/spec). ## Data Format The dataset is structured in JSON format, with each entry containing parallel text in the three language variants. Example entry: ```json { "yue": "講唔到重點而且唔夠全面", "zh": "说不了重点而且不够全面", "zht": "說不了重點而且不夠全面" } ``` ## Use Cases This dataset can be valuable for various natural language processing tasks, including: - Machine translation between Cantonese, Simplified Chinese, and Traditional Chinese - Comparative linguistic studies of Chinese language variants - Development of multilingual Chinese language models ## Limitations Users should be aware that automatic conversion between Simplified and Traditional Chinese, while generally reliable, may not always capture differences in vocabulary and idioms between regions using different writing systems. ## Licence and Acknowledgements Contains data from - "botisan-ai/cantonese-mandarin-translations" by Liang, H., https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations, licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0). - "raptorkwok/cantonese-chinese-dataset-gen2" by Raptor K, available at https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2, licensed under CC0.

# 粤语-普通话-繁体中文平行语料库 本数据集提供粤语、简体中文与繁体中文的平行语料库。 ## 数据集构成 本数据集由两个现有数据集合并构建而成: 1. [botisan-ai/cantonese-mandarin-translations](https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations) 2. [raptorkwok/cantonese-chinese-dataset-gen2](https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2) **训练集**:由两个源数据集合并得到 **测试集与验证集**:源自raptorkwok/cantonese-chinese-dataset-gen2 ## 语言变体 本数据集包含三种语言变体: 1. **yue**:以书面形式转写的口语粤语(粤方言) 2. **zh**:简体中文 3. **zht**:香港地区使用的书面繁体中文 ## 转换流程 简体中文向繁体中文的转换通过[StarCC](https://github.com/StarCC0/spec)完成。 ## 数据格式 本数据集采用JSON格式组织,每条数据均包含三种语言变体的平行文本。示例条目如下: json { "yue": "講唔到重點而且唔夠全面", "zh": "说不了重点而且不够全面", "zht": "說不了重點而且不夠全面" } ## 应用场景 本数据集可应用于多种自然语言处理任务,包括: - 粤语、简体中文与繁体中文之间的机器翻译 - 汉语变体的比较语言学研究 - 多语言汉语大语言模型(Large Language Model,LLM)的开发 ## 局限性 用户需注意,尽管简体与繁体中文的自动转换通常较为可靠,但未必能完全覆盖不同书写体系地区间的词汇与习语差异。 ## 授权与致谢 本数据集包含以下来源的数据: - "botisan-ai/cantonese-mandarin-translations" 由Liang, H.制作,详见https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations,采用CC BY-NC-SA 4.0许可协议(https://creativecommons.org/licenses/by-nc-sa/4.0)。 - "raptorkwok/cantonese-chinese-dataset-gen2" 由Raptor K制作,详见https://huggingface.co/datasets/raptorkwok/cantonese-chinese-dataset-gen2,采用CC0许可协议。
提供机构:
maas
创建时间:
2025-03-12
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个粤语、简体中文和繁体中文的平行语料库,包含三种语言变体的对应文本,数据格式为JSON。它由两个现有数据集合并而成,适用于机器翻译和语言比较研究等自然语言处理任务。数据集采用Apache 2.0许可证,但源数据包含其他许可。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作