Erzya and Moksha Extended Corpora (ERME)

Mendeley Data2024-01-31 更新2024-06-28 收录

下载链接：

https://etsin.fairdata.fi/dataset/5aeaa4f5-3ba3-4932-a63d-73f2ed0bb8cd

下载链接

链接失效反馈

官方服务：

资源简介：

ERME contains predominantly Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level. At sentence level contextual translation is used (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morphological analysis is carried out using HFST-based transducers, which have been developed in the Giellatekno infrastructure of the University of Tromsø. The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages. Amount of processed material: more than a million words. The amount of the processed material is to be increased subsequently. ERME will be made available at http://korp.csc.fi.

ERME数据集主要收录埃尔齐亚语（Erzya）与莫克沙语（Moksha）文学作品，涵盖19至20世纪的各类媒体出版物。该数据集的标注工作于1997年至2004年间在萨兰斯克（Saransk）开展，赫尔辛基（Helsinki）的标注工作则自2004年启动。其采用的基础存储格式为可扩展标记语言（XML），标注粒度可达章节层级；建设目标为构建粒度可达词级的语料库。在句子层级采用上下文翻译方案，提供英语或芬兰语译文；在词级则针对每个上下文进行形态编码。初步形态分析工作依托特罗姆瑟大学（University of Tromsø）Giellatekno基础设施中开发的基于HFST的换能器完成，语法分析与标注工作遵循该基础设施所制定的规范，此类规范已应用于多种乌拉尔语系（Uralic languages）语言的文档标注工作。目前已处理语料规模超过100万词，后续还将进一步扩充该语料的规模。ERME数据集将在http://korp.csc.fi 平台对外提供访问。

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集