cantonese-mandarin-translations

Name: cantonese-mandarin-translations
Creator: maas
Published: 2025-10-09 09:40:42
License: 暂无描述

魔搭社区2025-10-09 更新2025-03-08 收录

下载链接：

https://modelscope.cn/datasets/pengzhendong/cantonese-mandarin-translations

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for cantonese-mandarin-translations ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This is a machine-translated parallel corpus between Cantonese (a Chinese dialect that is mainly spoken by Guangdong (province of China), Hong Kong, Macau and part of Malaysia) and Chinese (written form, in Simplified Chinese). ### Supported Tasks and Leaderboards N/A ### Languages - Cantonese (`yue`) - Simplified Chinese (`zh-CN`) ## Dataset Structure JSON lines with `yue` field and `zh` field for the parallel corpus. ### Data Instances N/A ### Data Fields - `yue`: Cantonese corpus - `zh`: translated Chinese corpus ### Data Splits No data splitting is done as of yet. ## Dataset Creation The dataset is produced by doing the following: - Download [HKCancor Cantonese Corpus](https://github.com/fcbond/hkcancor) and [CommonVoice Cantonese (Hong Kong Chinese `yue`) text corpus](https://commonvoice.mozilla.org/en/datasets) - Extract text corpus and merge datasets - Run text against [Microsoft's Translator API](https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support) from `yue` to `zh-Hans` ### Curation Rationale Currently no such corpus exists, and it is hard to find such a corpus, so we tried to generate a reasonable batch of samples using machine translation for research purposes. ### Source Data - [HKCancor](https://github.com/fcbond/hkcancor) - [CommonVoice 7.0 Chinese (Hong Kong)](https://commonvoice.mozilla.org/en/datasets) #### Initial Data Collection and Normalization Normalization scripts will be included soon. #### Who are the source language producers? - [HKCancor](https://github.com/fcbond/hkcancor) - [CommonVoice 7.0 Chinese (Hong Kong)](https://commonvoice.mozilla.org/en/datasets) ### Annotations #### Annotation process We run the Cantonese text corpus against Microsoft's Translator API. #### Who are the annotators? - [Microsoft's Translator API](https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support) ### Personal and Sensitive Information N/A ## Considerations for Using the Data ### Social Impact of Dataset We would like to share this parallel corpus and welcome contributions to preserve the Cantonese dialect. ### Discussion of Biases N/A ### Other Known Limitations This parallel corpus is machine-translated, it is not 100% accurate. ## Additional Information ### Dataset Curators - [Botisan AI](https://botisan.ai) - [Haoran (Simon) Liang](https://github.com/lhr0909) ### Licensing Information [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) ### Citation Information ``` @misc {botisanAiCantoneseMandarinTranslationsDatasets, author = {Liang, H.}, title = {Cantonese Mandarin Translations Dataset}, year = {2021}, url = {https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations}, } ``` ### Contributions Thanks to [@lhr0909](https://github.com/lhr0909) for adding this dataset.

# 粤语-普通话翻译数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测榜](#supported-tasks-and-leaderboards) - [语言分布](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据集划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页:** - **代码仓库:** - **相关论文:** - **评测榜:** - **联系方式:** ### 数据集概述本数据集为机器翻译生成的粤语（Cantonese，中国广东省、香港、澳门及马来西亚部分地区使用的汉语方言）与简体中文（Simplified Chinese）书面语的平行语料库。 ### 支持任务与评测榜无 ### 语言分布 - 粤语（Cantonese）(`yue`) - 简体中文（Simplified Chinese）(`zh-CN`) ## 数据集结构本平行语料库采用JSON行格式，包含`yue`字段与`zh`字段。 ### 数据样例无 ### 数据字段 - `yue`: 粤语语料 - `zh`: 翻译后的中文语料 ### 数据集划分截至目前尚未进行数据集划分。 ## 数据集构建本数据集通过以下步骤构建： - 下载[香港粤语语料库HKCancor](https://github.com/fcbond/hkcancor)与[CommonVoice 粤语（香港汉语`yue`）文本语料库](https://commonvoice.mozilla.org/en/datasets) - 提取文本语料并合并数据集 - 通过[微软翻译API（Microsoft Translator API）](https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support)将文本从`yue`翻译至`zh-Hans` ### 构建初衷目前尚无此类平行语料库，且此类语料较难获取，因此我们尝试通过机器翻译生成一批可用样本以供研究使用。 ### 源数据 - [HKCancor](https://github.com/fcbond/hkcancor) - [CommonVoice 7.0 粤语（香港）](https://commonvoice.mozilla.org/en/datasets) #### 初始数据收集与标准化标准化脚本将在近期上线。 #### 源数据生产者 - [HKCancor](https://github.com/fcbond/hkcancor) - [CommonVoice 7.0 粤语（香港）](https://commonvoice.mozilla.org/en/datasets) ### 标注 #### 标注流程我们通过微软翻译API（Microsoft Translator API）对粤语语料进行翻译。 #### 标注者 - [微软翻译API（Microsoft Translator API）](https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support) ### 个人与敏感信息无 ## 数据集使用注意事项 ### 数据集社会影响我们希望共享该平行语料库，并欢迎各界贡献内容以助力粤语方言的保护。 ### 偏差讨论无 ### 其他已知局限性本平行语料库由机器翻译生成，并非100%准确。 ## 附加信息 ### 数据集维护者 - [Botisan AI](https://botisan.ai) - [梁浩然（Simon Liang）](https://github.com/lhr0909) ### 授权信息 [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) ### 引用信息 @misc {botisanAiCantoneseMandarinTranslationsDatasets, author = {Liang, H.}, title = {Cantonese Mandarin Translations Dataset}, year = {2021}, url = {https://huggingface.co/datasets/botisan-ai/cantonese-mandarin-translations}, } ### 贡献者感谢[@lhr0909](https://github.com/lhr0909)为本数据集提交内容。

提供机构：

maas

创建时间：

2025-03-06

搜集汇总

数据集介绍