theme-d-Prose 1848-1920

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12666499

下载链接

链接失效反馈

官方服务：

资源简介：

The literary text corpus "theme-d-Prose 1848-1920" is a specialized collection of 1,227 German-language literary prose texts (shortest: 2,048 words; longest: 100,909 words). It is an extended subcorpus of "d-Prose 1870-1920" (Gius/Guhr/Adelmann 2021), which is the source of 804 of its text files, originally taken from the corpus KOLIMO (Herrmann and Lauer 2017), which in turn is based on the repositories Gutenberg-DE (Projekt Gutenberg-DE, Hille & Partner, and Reuters 2017), Deutsches Textarchiv (Geyken et al. 2011), and TextGrid (TextGrid Repository 2020). An additional 423 texts were added from Gutenberg-DE (Projekt Gutenberg-DE, Hille & Partner, and Reuters 2017), the Deutsches Textarchiv (Geyken et al. 2011), and the Wikisource Collections (Wikimedia Stiftung 2023). In terms of social and literary history, the selected publication period (1848-1920) covers a time of many political changes in the German-speaking area with a diverse literary production that can retrospectively be grouped into the following major literary movements and epochs: the end of pre-March and the Biedermeier, Realism, Naturalism, Modernism, and Expressionism. In addition to the nationality of the authors and the years of publication, the texts for "theme-d-Prose 1848-1920" were selected primarily on the basis of the 'non-fictional elements', such as geographical, political, or historical entities, present in their fiction, which were found to share a common spatial and temporal setting of their fictional worlds representing the German-speaking area of the 19th and early 20th centuries, making "theme-d-Prose 1848-1920" to be an extended and thematically specialized "d-Prose 1870-1920" subcorpus. Each text was manually tagged with the markup XML element from the TEI standards (TEI Consortium 2022, Burnard, Schöch, and Odebrecht 2021). In addition to that, the indications of volume (e.g. "Buch 1" or "Band 2") and chapter headings or chapter numbers were manually tagged with markup XML elements for volumes and for chapters. The text corpus is available as a zipped folder of XML files and has been enriched with metadata in an additional CSV file. The CSV file contains basic metadata about: the authors: name, pseudonyms, year of birth and death, gender, nationality, authority file identifier (GND ID, Deutsche Nationalbibliothek 2024) the corpus texts: title, number of words, date of first publication, genre (as far as this information is directly documented in the consulted repositories, i.e. in the metadata of d-Prose, or Wikipedia and the consulted literary encyclopedias and literary histories, such as Arnold (2020), Killy and Kühlmann (2008), Stein and Stein (2008), or Brenner (2011)) the fiction of the texts: information about the time and place of the fiction, including keywords and quotes taken from the texts number of texts sum in words average median standard deviation 1,227 34,534,384 28,145.38 16,971 27,036.46 total number of authors 346 100% number of female authors 89 25.7% number of male authors 257 74.3% number of texts percentage of the entire corpus by female authors 315 25.7% by male authors 912 74.3% decade number of texts number of words 1848-1860 137 2,382,896 1861-1870 135 2,353,938 1871-1880 121 3,584,370 1881-1890 162 4,684,519 1891-1900 256 6,695,539 1901-1910 258 8,488,707 1911-1920 158 6,344,415 Clusters of Subcorpora: cluster name subcorpora number of texts number of words shortest text longest texts author gender female authors male authors 316 912 9,938,452 24,595,932 2,500 2,048 100,909 100,180 publication date 1848-1870 1871-1900 1901-1920 272 540 416 4,736,834 14,964,428 14,833,122 2,201 2,316 2,048 99,954 100,780 100,909 text length very short texts short texts medium texts long texts 425 461 196 145 2,327,416 9,250,959 10,741,578 12,214,431 2,048 10,029 40,001 70,208 9,959 39,878 69,387 100,909

创建时间：

2024-07-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集