five

Pennsylvania German word list (lemmatized and POS-annotated)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/6613478
下载链接
链接失效反馈
官方服务:
资源简介:
The file presents the words used in the Pennsylvania German part of the ENDE corpus (www.deitsch.eu). The list contains every lemma with its associated word forms documented in the corpus, comprised of 1761 lemmata and 2704 word forms. The ENDE corpus (“English-Deitsch translation corpus”) is the first POS-annotated and searchable text corpus in Pennsylvania German (= Deitsch; ISO language code: pdc), aligned to the English source texts. Despite many digital texts in Deitsch are available on the internet, there are, so far, no digital corpora for this language. This is due mainly to the lack of a generally recognized standard variety which could serve as a reference point for the linguistic analysis needed for lemmatization and annotation. Lemmatization was done with the help of different lexicographic resources (https://www.deitsch.eu/news/view/9) most of which follow other spelling conventions. A fair number of word forms, especially English loanwords of some sort, cannot be found in the dictionaries. Moreover, the variety used here is characterized by a high variability regarding not only the spelling but also other aspects of the language. Part-of-speech tags were assigned manually (see tagsets A and B below). These tagsets for part-of-speech annotation of Deitsch texts are based on the 2017 version of the STTS system created and widely used for German (https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/6063/file/Westpfahl_Schmidt_Jonietz_Borlinghaus_STTS_2_0_2017.pdf), which has been slightly modified and adapted to the corpus texts written in the Plain Deitsch variety. Tagset A gives a broader view and refers to the lemma level, tagset B is more fine-grained and suitable for  the single word forms documented in the corpus. Only those tags are listed which are actually employed for the annotation of the corpus texts. Foreign items not integrated in the Deitsch text flow (e.g. English quotations) have been omitted. For more details about the corpus and the project please refer to the above mentioned website.
创建时间:
2024-07-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作