Parallel corpus of sentences in Komi-Permyak (/-Zyrian), Polish and English with annotations of personal information

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14845328

下载链接

链接失效反馈

官方服务：

资源简介：

This is a parallel corpus of sentences in three different languages: Komi (Komi-Permyak and Komi-Zyrian), Polish and English. The sentences were extracted from Universal Dependencies treebanks for Komi. Across 366 extracted sentences, 170 sentences were translated to Polish and English. The sentences were first translated by one of the authors of this corpus (a native Komi-Permyak speaker and a proficient English speaker) from Komi-Permyak and Komi-Zyrian to English with the help of multiple translation tools such as Neurotõlge, Google Translate, and Majbyr Translate. Polish translations were created by one of the authors of this corpus (a native Polish speaker and a proficient English speaker) based off of the English translations, and with the help of Google Translate in some cases.The original names of people and places were preserved during translation into English and the final form of the translated sentence was always overseen by a human. Words in sentences are accompanied by semantic tags used in the GiellaLT infrastructure. These tags classify names and nouns into categorie that can be used to identify nouns which are possible instances of personal information. The primary purpose of creating this corpus is to study how personal information is expressed in different languages, specifically in languages with varied linguistic resource availability. This resource can also be used to evaluate language models for the tasks involving personal information detection and identification. The dataset is presented, used and described in more detail in the following upcoming publication: Nikolai Ilinykh, Maria Irena Szawerna. 2025. "I Need More Context and an English Translation": Analysing How LLMs Identify Personal Information in Komi, Polish, and English. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains, Tallinn, Estonia.

创建时间：

2025-02-10