theme-d-Prose 1848-1920
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12666499
下载链接
链接失效反馈官方服务:
资源简介:
The literary text corpus "theme-d-Prose 1848-1920" is a specialized collection of 1,227 German-language literary prose texts (shortest: 2,048 words; longest: 100,909 words).
It is an extended subcorpus of "d-Prose 1870-1920" (Gius/Guhr/Adelmann 2021), which is the source of 804 of its text files, originally taken from the corpus KOLIMO (Herrmann and Lauer 2017), which in turn is based on the repositories Gutenberg-DE (Projekt Gutenberg-DE, Hille & Partner, and Reuters 2017), Deutsches Textarchiv (Geyken et al. 2011), and TextGrid (TextGrid Repository 2020).
An additional 423 texts were added from Gutenberg-DE (Projekt Gutenberg-DE, Hille & Partner, and Reuters 2017), the Deutsches Textarchiv (Geyken et al. 2011), and the Wikisource Collections (Wikimedia Stiftung 2023).
In terms of social and literary history, the selected publication period (1848-1920) covers a time of many political changes in the German-speaking area with a diverse literary production that can retrospectively be grouped into the following major literary movements and epochs: the end of pre-March and the Biedermeier, Realism, Naturalism, Modernism, and Expressionism.
In addition to the nationality of the authors and the years of publication, the texts for "theme-d-Prose 1848-1920" were selected primarily on the basis of the 'non-fictional elements', such as geographical, political, or historical entities, present in their fiction, which were found to share a common spatial and temporal setting of their fictional worlds representing the German-speaking area of the 19th and early 20th centuries, making "theme-d-Prose 1848-1920" to be an extended and thematically specialized "d-Prose 1870-1920" subcorpus.
Each text was manually tagged with the markup XML element from the TEI standards (TEI Consortium 2022, Burnard, Schöch, and Odebrecht 2021). In addition to that, the indications of volume (e.g. "Buch 1" or "Band 2") and chapter headings or chapter numbers were manually tagged with markup XML elements for volumes and for chapters.
The text corpus is available as a zipped folder of XML files and has been enriched with metadata in an additional CSV file.
The CSV file contains basic metadata about:
the authors: name, pseudonyms, year of birth and death, gender, nationality, authority file identifier (GND ID, Deutsche Nationalbibliothek 2024)
the corpus texts: title, number of words, date of first publication, genre (as far as this information is directly documented in the consulted repositories, i.e. in the metadata of d-Prose, or Wikipedia and the consulted literary encyclopedias and literary histories, such as Arnold (2020), Killy and Kühlmann (2008), Stein and Stein (2008), or Brenner (2011))
the fiction of the texts: information about the time and place of the fiction, including keywords and quotes taken from the texts
number of texts
sum in words
average
median
standard deviation
1,227
34,534,384
28,145.38
16,971
27,036.46
total number of authors
346
100%
number of female authors
89
25.7%
number of male authors
257
74.3%
number of texts
percentage of the entire corpus
by female authors
315
25.7%
by male authors
912
74.3%
decade
number of texts
number of words
1848-1860
137
2,382,896
1861-1870
135
2,353,938
1871-1880
121
3,584,370
1881-1890
162
4,684,519
1891-1900
256
6,695,539
1901-1910
258
8,488,707
1911-1920
158
6,344,415
Clusters of Subcorpora:
cluster name
subcorpora
number of texts
number of words
shortest text
longest texts
author gender
female authors
male authors
316
912
9,938,452
24,595,932
2,500
2,048
100,909
100,180
publication date
1848-1870
1871-1900
1901-1920
272
540
416
4,736,834
14,964,428
14,833,122
2,201
2,316
2,048
99,954
100,780
100,909
text length
very short texts
short texts
medium texts
long texts
425
461
196
145
2,327,416
9,250,959
10,741,578
12,214,431
2,048
10,029
40,001
70,208
9,959
39,878
69,387
100,909
创建时间:
2024-07-08



