adeshkin/khakas-russian-parallel-corpus
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/adeshkin/khakas-russian-parallel-corpus
下载链接
链接失效反馈官方服务:
资源简介:
Khakas-Russian Parallel Corpus是一个旨在支持哈卡斯语(一种被联合国教科文组织列为“明确濒危”的语言)自然语言处理工具和机器翻译开发的数据集。该数据集包含159,213个哈卡斯语和俄语的平行句子对,涵盖了多种领域和来源。数据集的创建过程涉及多个阶段,包括人工翻译和AI辅助翻译,并经过严格的质量控制。数据集还提供了详细的统计信息、翻译过程、数据质量说明、语言背景信息、贡献者名单、字母表/字符集以及引用方式。
The Khakas-Russian Parallel Corpus is a dataset designed to support the development of natural language processing (NLP) tools and machine translation for the Khakas language, which is classified as a "Definitely Endangered" language by UNESCO. The dataset contains 159,213 parallel sentence pairs in Khakas and Russian, covering various domains and sources. The creation of the dataset involved multiple stages, including manual translation and AI-assisted translation, with rigorous quality control. The dataset also provides detailed statistics, translation processes, data quality notes, language background information, a list of contributors, an alphabet/character set, and citation instructions.
提供机构:
adeshkin



