CMLI-NLP/CUTE-Datasets
收藏Hugging Face2025-06-19 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/CMLI-NLP/CUTE-Datasets
下载链接
链接失效反馈官方服务:
资源简介:
CUTE数据集是一个大规模多语言数据集,包括中文、维吾尔语、藏语和英语四种语言。它旨在增强低资源语言的跨语言知识迁移,包含平行语料和非平行语料,总数据量约为50GB。平行语料中,中文、英语、维吾尔语和藏语的数据量分别为2.62GB、3.49GB、7.37GB和11.22GB。非平行语料中,中文、英语、维吾尔语和藏语的数据量分别为2.64GB、3.49GB、7.77GB和11.90GB。数据集通过机器翻译生成,并经过人工评估,各语言翻译的平均得分在8.5到9.1之间。
The CUTE dataset is a large-scale multilingual dataset including Chinese, Uyghur, Tibetan, and English. It is designed to enhance cross-lingual knowledge transfer for low-resource languages, containing both parallel and non-parallel corpora, with a total size of approximately 50GB. In the parallel corpora, the sizes for Chinese, English, Uyghur, and Tibetan are 2.62GB, 3.49GB, 7.37GB, and 11.22GB respectively. In the non-parallel corpora, the sizes are 2.64GB for Chinese, 3.49GB for English, 7.77GB for Uyghur, and 11.90GB for Tibetan. The dataset was generated through machine translation and has been evaluated by humans, with average translation scores ranging from 8.5 to 9.1 for different language pairs.
提供机构:
CMLI-NLP



