five

The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0

收藏
DataCite Commons2025-11-21 更新2024-07-13 收录
下载链接:
https://nrc-digital-repository.canada.ca/eng/view/object/?id=c7e34fa7-7629-43c2-bd6d-19b32bf64f60
下载链接
链接失效反馈
官方服务:
资源简介:
The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This dataset is a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language, or an Indigenous language of the Americas, released to date. Accompanying the corpus is a subset of gold standard alignments for alignment evaluation purposes, and scripts to replicate the preprocessing used in our baseline machine translation experiments.
提供机构:
National Research Council Canada
创建时间:
2020-01-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作