five

Multilingual Collaborative Representation Dataset

收藏
DataCite Commons2025-04-27 更新2025-05-18 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=2a9b8cd1f1be4e1fac58e287afc453b8
下载链接
链接失效反馈
官方服务:
资源简介:
This study proposes and constructs a Chinese-based multilingual collaborative representation dataset, covering low-resource languages such as Tibetan, Uyghur, and Mongolian, aiming to improve the performance of these languages in cross-lingual tasks. The dataset is collected through multiple channels, including government open resources, ethnic cultural websites, and academic resources, ensuring the diversity and representativeness of the corpus. After data collection, machine translation technology is used to generate preliminary bilingual data, and sentence pair alignment tools are employed to optimize the translation results, ensuring high-quality bilingual alignment data. Based on Chinese as a bridge language, the study further constructs a shared multilingual semantic space, which enhances the semantic consistency between different languages by incorporating Chinese semantic information such as synonyms, near-synonyms, and taxonomic relations. The semantic triples in this space provide strong support for semantic alignment between low-resource languages. The dataset has undergone strict quality control and manual verification during the construction process, ensuring its high quality and reliability. This dataset provides rich semantic support for cross-lingual tasks, information retrieval, and multilingual generation in low-resource languages, and has extensive application potential.
提供机构:
Science Data Bank
创建时间:
2025-04-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作