XBMU-MC: A Multilingual Parallel Corpus
收藏DataCite Commons2025-07-25 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=5163c98e9eff48dbb65e3b40282beca2
下载链接
链接失效反馈官方服务:
资源简介:
The XBMU-MC Multilingual Parallel Corpus consists of 22,000 high-quality parallel corpora covering Chinese-Tibetan, Chinese-Uighur and Chinese-Mongolian low-resource language pairs. Each data sample contains text pairs in both source and target languages, where the source language is Chinese and the target languages include Tibetan, Uyghur and Mongolian. Each sample has a uniform structure, including two main fields: instruction and input, and the corresponding output. the instruction field is used to describe the type or requirement of the translation task, the input field contains the original text in the source language, and the output field is the translated text in the target language.To ensure the quality and consistency of the data, each translation pair is manually reviewed and automatically evaluated to ensure alignment accuracy between source and target languages as well as translation accuracy. The dataset covers a wide range of fields such as culture, science and technology, and society, and each translated text contains different expressions and language structures, which helps to enhance the robustness and generalization ability of the model. The dataset is stored in standard JSON format, which is convenient for subsequent task processing, analysis and model training.
提供机构:
Science Data Bank
创建时间:
2025-05-19



