Zeteng/cantonese-llm-data
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Zeteng/cantonese-llm-data
下载链接
链接失效反馈官方服务:
资源简介:
Cantonese SFT Chat Dataset是一个高质量的粤语指令微调数据集,旨在训练或微调7B参数规模的大语言模型(如Qwen, DeepSeek, Yi等),以提升其粤语口语对话能力。数据集包含口语化语料,解决通用模型在粤语对话中书面语过重或翻译腔的问题。数据来源包括预训练/通用文本(如YueData、AlienKevin/LIHKG、Cantonese Wikipedia)和SFT/高质量标注数据(如HKCanCor、Common Voice (Yue)、PUD Cantonese、Custom Excel Data)。数据集经过清洗、过滤和格式化处理,支持多种微调框架,并提供了详细的数据价值评估和使用指南。
The Cantonese SFT Chat Dataset is a high-quality Cantonese instruction fine-tuning dataset designed to train or fine-tune 7B-parameter large language models (such as Qwen, DeepSeek, Yi, etc.) to enhance their Cantonese spoken dialogue capabilities. The dataset includes spoken language corpora to address the issue of overly formal or translated-sounding Cantonese in general models. Data sources include pre-training/general texts (e.g., YueData, AlienKevin/LIHKG, Cantonese Wikipedia) and SFT/high-quality annotated data (e.g., HKCanCor, Common Voice (Yue), PUD Cantonese, Custom Excel Data). The dataset has undergone cleaning, filtering, and formatting processes, supports multiple fine-tuning frameworks, and provides detailed data value assessments and usage guidelines.
提供机构:
Zeteng



