German Summary Corpus (GerSumCo) v1.0.0
收藏hdl.handle.net2025-01-09 收录
下载链接:
http://hdl.handle.net/20.500.12124/81
下载链接
链接失效反馈官方服务:
资源简介:
The GerSumCo (German Summary Corpus) is a learner corpus comprising syntheses written by L2 German writers (CEFR B2/C1) and writers of L1 German. The corpus has been created with the objective of conducting a comparative analysis of the academic writing of L1 German and L2 German students.
The two subcorpora (L1 and L2) contain a total of 286 texts (178 L1 and 108 L2), written by 286 students at 14 universities and language schools in Germany (Bamberg, Bochum, Dresden, Hamburg, Hildesheim, Kiel, Leipzig, Magdeburg, Osnabrück, Potsdam, Trier, Wuppertal), Poland (Gdansk) and China (Hangzhou). The texts were collected between 2022 and 2024 as part of a PhD research project about a contrastive interlanguage analysis using GerSumCo and Beldeko to identify L1-dependent features in cohesion in L2/L1 German.
The metadata files (Meta_GerSumCo_L1 & Meta_GerSumCo_L2) contain the following information:
- Up to three L1s of the writers
- Up to three L2s of the writers
- Collection date
- Topic
- Whether the text was written as homework or in class
- Group of students the texts belonged to
The file names contain the following information:
- Whether the text is part of the L1 or L2 subcorpus
- Topic
The summaries, on average, consist of 230 words. The texts were either produced in class on computers or as homework, within a 60-minute time frame. Students were permitted to use online dictionaries, but no AI-based auxiliary means. They were required to summarise two texts on one of four topics related to language variation in German: Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch and Varianten-Wörterbuch des Deutschen.
This version contains the TXT files of the texts and the CSV files containing the manual annotations of the texts with token ID, sentence ID, source text form, target form, automatic annotated lemma, POS (STTS) and simple UPOS part-of-speech tag.
德国摘要语料库(GerSumCo)系由二语德语作者(欧洲共同参考框架CEFR B2/C1级别)及母语为德语的作者所撰写的摘要组成的学习语料库。该语料库旨在进行德语母语者和二语德语学生学术写作的比较分析。该语料库包含两个子语料库(母语和二语),共计286篇文本(其中母语文本178篇,二语文本108篇),由德国(巴姆贝格、波鸿、德累斯顿、汉堡、希尔德斯海姆、基尔、莱比锡、马格德堡、奥斯纳布吕克、波茨坦、特里尔、伍珀塔尔)和波兰(格但斯克)以及中国(杭州)的14所大学和语言学校的学生撰写。这些文本收集于2022年至2024年间,作为一项关于使用GerSumCo和Beldeko进行对比语际分析、以识别二语/一语德语中依赖母语特征的博士研究项目的一部分。元数据文件(Meta_GerSumCo_L1 & Meta_GerSumCo_L2)包含以下信息:作者至多三种母语、至多三种二语、收集日期、主题、文本是否为作业或课堂作业、文本所属的学生群体。文件名包含以下信息:文本是否属于母语或二语子语料库、主题。摘要平均由230个单词组成。文本要么在课堂上使用电脑生成,要么作为作业完成,时间限制为60分钟。学生被允许使用在线词典,但不得使用基于AI的辅助工具。他们需要就与德语语言变异相关的四个主题之一(街头德语、瑞士方言辩论、维德林语和德语变体词典)对两篇文本进行总结。本版本包含文本的TXT文件以及包含文本的token ID、句子ID、源文本形式、目标形式、自动注释词元、词性(STTS)和简单UPOS词性标签的手动注释CSV文件。
提供机构:
hdl.handle.net



