Multilingual Collaborative Representation Dataset
收藏DataCite Commons2025-04-27 更新2025-05-18 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=2a9b8cd1f1be4e1fac58e287afc453b8
下载链接
链接失效反馈官方服务:
资源简介:
This study proposes and constructs a Chinese-based multilingual collaborative representation dataset, covering low-resource languages such as Tibetan, Uyghur, and Mongolian, aiming to improve the performance of these languages in cross-lingual tasks. The dataset is collected through multiple channels, including government open resources, ethnic cultural websites, and academic resources, ensuring the diversity and representativeness of the corpus. After data collection, machine translation technology is used to generate preliminary bilingual data, and sentence pair alignment tools are employed to optimize the translation results, ensuring high-quality bilingual alignment data. Based on Chinese as a bridge language, the study further constructs a shared multilingual semantic space, which enhances the semantic consistency between different languages by incorporating Chinese semantic information such as synonyms, near-synonyms, and taxonomic relations. The semantic triples in this space provide strong support for semantic alignment between low-resource languages. The dataset has undergone strict quality control and manual verification during the construction process, ensuring its high quality and reliability. This dataset provides rich semantic support for cross-lingual tasks, information retrieval, and multilingual generation in low-resource languages, and has extensive application potential.
提供机构:
Science Data Bank
创建时间:
2025-04-17



