INEL Dolgan Corpus

Name: INEL Dolgan Corpus
Creator: Universität Hamburg
Published: 2025-09-12 12:10:07
License: 暂无描述

DataCite Commons2025-09-12 更新2025-04-16 收录

下载链接：

https://www.fdr.uni-hamburg.de/record/9746

下载链接

链接失效反馈

官方服务：

资源简介：

Corpus Citation Däbritz, Chris Lasse; Kudryakova, Nina; Stapert, Eugénie. 2022. INEL Dolgan Corpus. Version 2.0. Publication date 2022-11-30. https://hdl.handle.net/11022/0000-0007-F9A7-4. Archived at Universität Hamburg. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1. Corpus Description The INEL Dolgan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages”), 2016–2033. The corpus makes possible typologically aware corpus-based grammatical research on the Dolgan language and expands the documentation of the lesser described indigenous languages of Northern Eurasia. The INEL Dolgan corpus is composed of texts from different sources: 1. Published folklore texts from an edited volume ("Fol'klor Dolgan", P.E. Efremov 2000), 2. Transcripts of recordings obtained from the Taymyr House of Folk Art (TDNT) in Dudinka (1970s-2000s), 3. Transcripts from the collection of Dr. Eugénie Stapert recorded on several fieldwork trips in 2007-2010, 4. Transcripts of recordings made on a fieldwork trip in 2017. The first group as well as parts of the third group were already transcribed and translated, the rest of the recordings was transcribed and translated within the INEL project. Each text in the corpus is provided with morphological glossing, translation into English, Russian and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles and information structure/information status. New in release 2.0 20 glossed transcripts (2864 utterances, 19989 tokens) with 03:33:14 hours of corresponding sound 37 audio files with 10:00:36 hours of sound without glossed transcripts Corrections of grammatical analyses and glossing according to the findings in Däbritz’s (2022) grammar, as well as cross-corpora harmonizations Additional corpus-wide annotation of Mongolic borrowings Additional corpus-wide annotation of existential, locative and possessive predication Corrections in further annotations, translations and metadata Funding The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies’ Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies’ Programme is coordinated by the Union of the German Academies of Sciences and Humanities.

语料库引用 Däbritz, Chris Lasse; Kudryakova, Nina; Stapert, Eugénie. 2022. INEL多尔干语料库。版本2.0。发布日期2022年11月30日。https://hdl.handle.net/11022/0000-0007-F9A7-4。存档于汉堡大学。收录于《欧亚北部原住民语言INEL语料库集》。https://hdl.handle.net/11022/0000-0007-F45A-1。 语料库描述 INEL多尔干语料库是在2016至2033年的长期INEL项目——“欧亚北部原住民语言的语法描述、语料库与语言技术”框架下创建的。该语料库可为多尔干语（Dolgan）的类型学导向型语料库语法研究提供支撑，同时完善对欧亚北部少数未被充分记录的原住民语言的文献存档工作。 INEL多尔干语料库的文本来源涵盖四类：1. 收录于编辑文集《多尔干民间故事》（P.E. Efremov，2000）中的已出版民间文学文本；2. 1970年代至2000年代从杜金卡市的泰米尔民间艺术之家（TDNT）获取的录音转写文本；3. 2007至2010年多次田野调查中由Eugénie Stapert博士收集的录音转写文本；4. 2017年田野调查中录制的录音转写文本。其中第一类文本及第三类文本的部分内容已提前完成转写与翻译，其余录音的转写与翻译工作均由INEL项目完成。语料库中的每篇文本均附带形态标注（morphological glossing）、英语、俄语及德语译文，以及俄语借词注释。部分文本还包含句法功能、语义角色及信息结构与信息状态的标注。 2.0版本更新内容 1. 20份带形态标注的转写文本（共2864条话语、19989个词元（Token）），对应音频时长3小时33分14秒； 2. 37个无标注转写的音频文件，总音频时长10小时00分36秒； 3. 根据Däbritz（2022）的语法研究成果修正了语法分析与形态标注，并完成跨语料库的格式统一； 4. 新增全语料库范围的蒙古语借词注释； 5. 新增全语料库范围的存在句、处所句与领属谓语句标注； 6. 修正了其余标注、译文及元数据内容。 资助信息 本语料库的制作依托德国联邦政府与联邦州联合开展的“科学院计划”联合研究资助项目，资助方为德国联邦教育与研究部以及自由汉萨同盟汉堡市。“科学院计划”由德国科学院与人文科学院联盟统筹协调。

提供机构：

Universität Hamburg

创建时间：

2021-12-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集