five

ERME – Erzya and Moksha Extended Corpora, Korp Version

收藏
Mendeley Data2024-01-31 更新2024-06-29 收录
下载链接:
https://etsin.fairdata.fi/dataset/918eedfd-c3fe-4501-aa99-f4c232795d88
下载链接
链接失效反馈
官方服务:
资源简介:
This resource is available in the Korp service of the Language Bank of Finland, see Access location. The resource contains the sentences of the original full texts in the ERME corpus in scrambled order. ERME contains predominantly Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level. For the next version: At sentence level contextual translation will be used (English or Finnish translation), while at word level there will be morphological encoding, corresponding to each context. Preliminary morphological analysis will be carried out using HFST-based transducers, which have been developed in the Giellatekno infrastructure of the University of Tromsø. The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages. Amount of processed material: more than a million words. The amount of the processed material is to be increased subsequently. ERME is available at http://korp.csc.fi.

本资源可通过芬兰语言银行(Language Bank of Finland)的Korp服务获取,详见访问地址。该资源收录ERME语料库中原始完整文本的句子,但句子顺序已被随机打乱。ERME语料库以埃尔齐亚语(Erzya)与莫克沙语(Moksha)文学作品为主体,涵盖19至20世纪的多种媒体出版物。ERME语料库的标注工作于1997-2004年在萨兰斯克开展,自2004年起在赫尔辛基持续推进。当前采用的基础格式为XML(可扩展标记语言),标注粒度可达章节层级;项目的远期目标是构建标注粒度可达词级的语料库。针对下一版本:句子层级将采用上下文翻译方案(可提供英语或芬兰语译文),词层级则将针对每个上下文实现形态学编码。初步形态学分析将采用基于HFST的换能器,该工具由特罗姆瑟大学(University of Tromsø)的Giellatekno基础设施开发完成。语法分析与标注工作遵循特罗姆瑟大学Giellatekno基础设施所制定的规范,该规范已应用于多种乌拉尔语系(Uralic languages)语言的文档标注工作。已处理语料规模超百万词,后续将进一步扩充处理语料总量。ERME语料库可通过网址http://korp.csc.fi获取。
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作