Udmurt dialectal dataset: postpositions / relational nouns
收藏DataCite Commons2025-03-25 更新2025-04-16 收录
下载链接:
https://www.fdr.uni-hamburg.de/record/17135
下载链接
链接失效反馈官方服务:
资源简介:
This is a dataset that contains sentences in various dialects of Udmurt / Beserman (Permic < Uralic; ISO 639-3 code udm). It mainly contains questionnaire responses collected for the research of the morphosyntax of Udmurt relational nouns and postpositions, annotated for several parameters.
This data was collected in 2019-2024 by Timofey Arkhangelskiy. Part of the data was collected in Udmurtia, Tatarstan and Bashkortostan (Russia); another part was collected in the Estonian Udmurt community (Tallinn and Tartu). Data collection, annotation and publishing were supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.
Before proceeding to the dataset, please keep in mind that:
Although I provide the questionnaire stimuli, the responses do not always contain their exact translations. Sometimes consultants forgot what exactly they were supposed to translate, added something to their translation, or translated only a part of the stimulus. It was not my goal for the translations to be close to the original. Therefore the stimulus and the response should not be treated as translation pairs.
The speakers were instructed to make translations in their own dialect rather than in the standard language. My transcriptions of their oral responses reflect all dialectal features and deviate from the standard language (sometimes significantly).
As a consequence:
If you want to use this dataset for its original purpose, you can just take the annotation and do not look at the actual examples. If you need it for anything beyond that purpose, you will only be able to do so if you have a good command of Udmurt. There are English translations of the stimuli, but they won't help you much.
DO <strong>NOT </strong>USE THIS DATASET FOR TRAINING MACHINE TRANSLATION OR UDMURT LANGUAGE MODELS!
The dataset has a TSV format (tab-delimited values). Please refer to readme.txt for further information.
If you have any questions or require help with processing the data, please feel free to contact Timofey Arkhangelskiy: timarkh@gmail.com.
本数据集收录了乌德穆尔特语(Udmurt)/别塞尔曼方言(Beserman)的各类方言语句,该语言隶属于乌拉尔语系(Uralic)彼尔姆语族(Permic),ISO 639-3代码为udm。数据集主体为针对乌德穆尔特语关系名词(relational nouns)与后置词(postpositions)的形态句法(morphosyntax)研究而收集的问卷应答文本,并针对多项参数完成了标注。
本数据集由Timofey Arkhangelskiy于2019年至2024年间完成收集。其中一部分数据采集自俄罗斯联邦的乌德穆尔特共和国、鞑靼斯坦共和国与巴什科尔托斯坦共和国;另一部分则采集自爱沙尼亚境内的乌德穆尔特族群社区(塔林与塔尔图)。数据收集、标注与发布工作获得了德国科学基金会(Deutsche Forschungsgemeinschaft, DFG)项目编号428175960的资助。
在使用本数据集前,请留意以下说明:
尽管本数据集附带问卷提示文本,但应答内容未必与提示文本完全对应。部分受访语者可能遗忘了需翻译的具体内容、自行添加了额外表述,或仅翻译了提示项的部分内容。本研究未要求应答与原文保持一致,因此请勿将提示文本与应答视为翻译对。
受访语者被要求使用自身所属方言而非标准乌德穆尔特语进行应答。本人对口述应答的转写完整保留了所有方言特征,与标准乌德穆尔特语存在差异,部分情况下差异较为显著。
据此:
若仅需使用本数据集的原始研究用途,可直接取用标注信息,无需参考实际例句。若需用于其他场景,则需具备熟练的乌德穆尔特语能力。尽管提示文本配有英文译文,但该译文对实际使用帮助有限。
严禁将本数据集用于机器翻译模型或乌德穆尔特语语言模型的训练!
本数据集采用TSV格式(制表符分隔值)。详细说明请参阅readme.txt文件。
若您有任何疑问或需要数据处理相关协助,请随时联系Timofey Arkhangelskiy:timarkh@gmail.com。
提供机构:
Universität Hamburg
创建时间:
2025-03-24



