five

MiMe-MeMo/Corpus-v1.1

收藏
Hugging Face2024-02-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MiMe-MeMo/Corpus-v1.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - da --- # MeMo corpus v1.1 Jens Bjerring-Hansen, Philip Diderichsen, Dorte Haltrup Hansen, June 2023 This is data release version 1.1 of the MeMo corpus comprising almost all Danish novels from the period 1870-1899, known as the Modern Breakthrough. The current version of the corpus is publicly viewable and searchable at <https://alf.hum.ku.dk/korp/?mode=memo_all>. The corpus has been enhanced since version 1.0 with the following 19 titles that have been reprocessed or added to the corpus. 1. Vilhelm Bergsøe: Bruden fra Rørvig (1872) 2. Johanne Schjørring: Rige Dage (1877) 3. Anonymous: Tante Jacobine (1878) 4. Jonas Lie: Rutland (1880) 5. Vilhelm Malling: Fra Kjøbstadlivet i gamle Dage (1882) 6. Adda Ravnkilde: To Fortællinger (1884) 7. Henrik Pontoppidan: Ung Elskov (1885) 8. Therese Brummer: Som man gifter sig (1888) 9. Henrik Pontoppidan: Natur (1890) 10. R.H.: En Kjøbenhavners Livshistorie eller Lykkens Omskiftelser (1891) 11. Henrik Pontoppidan: Minder (1893) 12. Johannes Jørgensen: Hjemvee (1894) 13. Henrik Pontoppidan: Nattevagt (1894) 14. Jonas Lie: Naar Sol gaar ned (1895) 15. Gustav Wied: Ungdomshistorier (1895) 16. Herman Bang: Ludvigsbakke (1896) 17. Cornelia Levetzow: Havemanden (1896) 18. Karl Larsen: Kresjan Vesterbro (1897) 19. Christian Christensen: Kærlighedens Mysterier (1899) The release contains the following files: | File | Contents | | :---------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | texts | Text files of the now 558 novels in the corpus. The text has a newline at line breaks in the book, and two newlines at page breaks. Some of the texts (the ones originally set in Fraktur) have been post-OCR-corrected using a procedure described in Bjerring-Hansen et al. (2022). The rest have been post-OCR-corrected. Error types were identified manually and implementet with look-up in the dictionary (Sprogteknologisk Ordbase, STO) to awoid the creation of new errors. This cautious method has the consequence that not all error were corrected. | | normalised | Orthographically normalized versions of the 558 texts. Same format as the files in "texts", normalized to Danish standard spelling. Nouns were lower cased, aa changed to å and frequent character patterns changed to obey the Danish orthography norm from 1948. Like the error corrected version of the corpus, character patterns were identified manually and mainly implementet with look-up in the dictionary (Sprogteknologisk Ordbase, STO) to awoid overgeneration. The method has the consequence that not all words were normalized. | | memo_all.vrt | VRT file (vertical format) of MeMo corpus v1.1 for indexing in Corpus Workbench (CWB). Format: One token per line delimited by \<corpus>, \<text>, and \<sentence> XML elements. The XML elements contain attributes with metadata. The tokens are annotated with various categories separated by tabs. For more information about the metadata, see the metadata excel file. For more information about the token annotations, see below. | | MeMo-corpus-metadata-v1.1-2023-06-20.xlsx | Excel file with metadata about the novels in the corpus. See the "info" tab for information about the metadata categories. | **Token annotations and metadata in VRT file** There are nine columns of tokens and annotations in the corpus VRT file: | Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6 | Column 7 | Column 8 | Column 9 | | :------- | :--------- | :--------- | :------------- | :------------------- | :--------------- | :--------------- | :--------------- | :------- | | Token | Normalized | Lemma form | Part of speech | Word no. in sentence | Word no. in line | Word no. in book | Line no. on page | Page no. | For information about the metadata also contained in the VRT file, se the file MeMo-corpus-metadata-v1.1-2023-06-20.xlsx. **References** Bjerring-Hansen, Jens, et al. "Mending Fractured Texts. A heuristic procedure for correcting OCR data." (2022). <https://ceur-ws.org/Vol-3232/paper14.pdf> **Data Statement** ## 1. Header 1. Dataset Title MeMo Corpus 2. Dataset Curator(s) [name, affiliation] Jens Bjerring-Hansen, University of Copenhagen; Philip Diderichsen, University of Copenhagen; Dorte Haltrup Hansen, University of Copenhagen 3. Dataset Version [version, date] Version 1.1, August 15, 2023 4. Dataset Citation and, if available, #### 5. DOI Data Statement #### 6. Author(s) [name, affiliation] Jens Bjerring-Hansen, University of Copenhagen; Philip Diderichsen, University of Copenhagen 7. Data Statement Version [version, date] Version 1, September 25, 2023 8. Data Statement Citation #### ## 2. Executive summary The MeMo corpus is established to investigate literary and cultural change in a seminal epoch of Scandinavian cultural and social history (known as 'the modern breakthrough') using natural language processing and other computational methods. The corpus consists of original novels by Norwegian and Danish authors printed in Denmark in the period 1870-99. It includes 858 volumes, totaling 4.5 million sentences and 65 million words. ## 3. Text characteristics The corpus consists of novels, i.e. long works of narrative fiction, usually written in prose and published as a book. The novels contain both dialogue and description. As instances of imaginative literature they are infused with ambiguity, interpretational confounding, rhetorical sophistication, and narrative layerings between author, narrator, and characters. The cultural diversity of the texts in the corpus is pronounced. From a genre perspective, we have contemporary novels as well as historical novels and other forms of genre fiction such as romance, crime, and war stories (cf. Bjerring-Hansen and Rasmussen, 2023). And from an aesthetic perspective we have both avant-garde forms of realism, including instances of naturalism and impressionism, and more traditional prose with a preference for abstract or generalized over concrete specification (cf. Bjerring-Hansen and Wilkens, 2023). Bjerring-Hansen, Jens, and Sebastian Ørntoft Rasmussen. 2023. “Litteratursociologi og kvantitative litteraturstudier Den historiske roman i det moderne gennembrud som case”. In Passage 89: 171–189. Bjerring-Hansen, Jens, and Matt Wilkens. 2023. “Deep distant reading: The rise of realism in Scandinavian literature as a case study”. Orbis Litterarum. [doi:10.1111/oli.12396](https://doi.org/10.1111/oli.12396) ## 4. Curation Rationale The MeMo Corpus was created as the basis for a research project, _MeMo – Measuring Modernity: Literary and Social Change in Scandinavia 1870-1900_, investigating how processes of social change in late nineteenth century Scandinavia were reflected and discussed in the novels from the period (project page: [https://nors.ku.dk/english/research/projects/measuring-modernity/](https://nors.ku.dk/english/research/projects/measuring-modernity/)). As opposed to traditional historiography on the period, which has focused on selected texts by a few prominent, male authors, our digital corpus, with rich metadata on texts and authors, allows for the capturing of robust literary and sociological trends and for new insights into the processes of modernization in this formative period in the literary and social history of Scandinavia. To this corpus we thus ask questions such as: How did this breakthrough of new ways of thinking and writing actually unfold? Who were the actors? And to what extent did newness relate to literature at large? Also, the corpus acts as the empirical foundation of an interrelated methodological project, _Mining the Meaning_, which aims to develop state-of-the-art computational semantic methods and training large language models towards written late 19th-century Danish and Norwegian (project page: [https://mime-memo.github.io/](https://mime-memo.github.io/)). Included in the corpus are all original (i.e. newly written) novels by Danish and Norwegian authors published in Denmark 1870-99. The list of texts was compiled on the basis of _Dansk Bogfortegnelse _(a continuous list of books published in Denmark since 1841; from 1861 published annually) supplemented with literary handbooks and special bibliographies. Not included (mainly due to pragmatic reasons and for the sake of coherence) in the corpus are: * reprints * translations * serializations (i.e. serialized novels from newspapers and magazines) * diasporic literature (i.e. novels by Danish emigrant authors in the U.S.) Around 20% of the novels are produced by female authors. Thus, highlighting and exploring the often overlooked female literary production of the period is a distinctive ambition of the corpus and the explorations based on it. ## 5. Language Varieties The language of the novels in the corpus is late nineteenth century Danish (BCP-47: da). On the whole, we are dealing with a more or less linguistically coherent body of texts. However, the following circumstances must be acknowledged: * The texts contain a pronounced spelling variation, partly on an individual level, partly explained by an ongoing orthographic standardization, which is most clearly expressed in the Spelling Reform of 1892. Here, forms such as 'Kjøbenhavn' and 'Familje' became 'København' and 'Familie'. * Some books are written in dialect (e.g. Jutlandic or West Norwegian) or contain dialectal features to create psychological individualism in the dialogue. * Approximately 16% of the books are written by Norwegian authors. In this regard it should be noted that, until 1907, written Norwegian was practically identical to written Danish. ‘Norvagisms’ (i.e. distinct Norwegian words, not used by Danes) do appear. ## 6. Preprocessing and data formatting **OCR scans**: The book volumes were scanned with optical character recognition (OCR) by the Royal Danish Library’s Digitization on Demand (DoD) team. The data were delivered as full volume PDF files with the OCR’ed text as an invisible searchable, copyable text layer, as full volume text files, and as single page text files (one text file per page for each volume). **OCR correction**: The text files were automatically post-corrected for OCR errors. This involved two different processes, one for texts originally typeset in Antikva (Roman) typefaces, one in Fraktur (Gothic) typefaces. The Antikva files were corrected using a set of hand-crafted substitution patterns, with look-up in the dictionary Sprogteknologisk Ordbase, STO (Eng. ‘Word database for language technology’). The Fraktur files were corrected using a correction procedure involving a combination of spelling correction, hand-crafted pattern substitution, and improved OCR using the pretrained “Fraktur” Tesseract data plus an alternative OCR layer from the pretrained “dan” Tesseract data, which was used as a corrective to problems with the Danish characters “æ” and “ø” in particular. This procedure improved the word error rate of the Fraktur data from 10.46% to 2.84% (cf. Bjerring-Hansen et al. 2022). Bjerring-Hansen, Jens, Philip Diderichsen, Dorte Haltrup Hansen, and Ross D. Kristensen-McLachlan. 2022. “Mending fractured texts. A heuristic procedure for correcting OCR.” Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference, Uppsala, Sweden, March 15-18, 2022 (DHNB 2022): 177–186. **Token-level annotation**: The corrected data were annotated with grammatical information using the pipeline orchestration tool Text Tonsorium available at [https://cst.dk/texton/](https://cst.dk/texton/), provided by the Danish CLARIN node. The particular pipeline used included the LaPos part of speech tagger, the CSTLemma lemmatizer, and an implementation of the Brill tagger. Grammatical information included lemma and part of speech, plus sentence and paragraph segmentation (which are of course not strictly speaking token-level annotations). In addition to the grammatical annotations, convenience annotations with various counters were also added: word number in sentence, word number on line, word number in book volume, line number on page, page number in book volume. **Text normalization**: After OCR correction, all texts were normalized to modern Danish spelling using hand-crafted substitution patterns and lookup in STO (see above). Nouns were lower cased, “aa” changed to “å”, and frequent character patterns changed to obey modern Danish orthography. **VRT transformation**: After annotation with token-level categories and metadata, the data were transformed to a VRT file (vertical format) for indexing in Corpus Workbench (CWB). Format: One token per line delimited by &lt;corpus>, &lt;text>, and &lt;sentence> XML elements. The XML elements contain attributes with metadata. The tokens are annotated with the above-mentioned token-level annotations, separated by tabs. For more information about the metadata, see below. The data are available as: * OCR-corrected full volume text files * Normalized full volume versions of these text files * A single VRT file containing the whole corpus. ## 7. Limitations A standard limitation of data preprocessed and annotated using automatic natural language processing tools and procedures is that the results are not perfect. Thus, basically all the layers of the data can be assumed to be flawed: * Text data: The raw texts come from OCR scans of the physical book volumes. This process is not perfect, and although we have taken steps to mitigate errors, the basic text layer of the data can still be expected to have OCR errors (or wrong corrections) in 2-3% of tokens. * Normalized data: The normalization to modern Danish spelling as such should not be expected to be perfect either. We currently do not have estimates of the error rate in the normalized data. * Grammatical annotations: These are also added using automatic tools which cannot be expected to yield perfect results. We currently do not have estimates of error rates in the grammatical annotations. * Metadata: The metadata are hand-curated by literary scholars and should be close to perfect. However, the occasional human error can of course not be ruled out. ## 8. Metadata The metadata was curated with the help of students (Lasse Stein Holst, Lene Thanning Andersen, and Kirstine Nielsen Degn) on the basis of _Dansk Bogfortegnelse_ (1861-), [https://www.litteraturpriser.dk/](https://www.litteraturpriser.dk/), Ehrencron-Müller: _Anonym- og Pseudonym-Lexikon_ (1940) as well as additional literary and bibliographical handbooks. Among the metadata categories are the following: * file_id * filename * [author] firstname * [author] surname * [author] pseudonym * [author] gender [m/f/unknown] * [author] nationality [da/no/unknown] * title * subtitle * volume * year [of publication] * pages [in total] * illustrations [y/n] * typeface [gothic/roman] * publisher * price ## 9. Disclosure and Ethical Review Funding for the creation and curation is supplied by The Carlsberg Foundation through a Young Researcher Fellowship awarded to Jens Bjerring-Hansen, University of Copenhagen. In terms of data management, the project data (novels from 1870-1900) consist of imaginative texts by non-living authors. The texts are out-of-copyright. From a GDPR perspective, the biographical, bibliographical and demographic data are historical as well as non-sensitive.
提供机构:
MiMe-MeMo
原始信息汇总

MeMo Corpus v1.1

概述

MeMo Corpus v1.1 包含1870年至1899年间几乎所有丹麦小说,这一时期被称为现代突破期。该数据集由Jens Bjerring-Hansen、Philip Diderichsen和Dorte Haltrup Hansen于2023年6月发布。

数据集版本

  • 版本: 1.1
  • 发布日期: 2023年6月

数据文件

数据集包含以下文件:

文件 内容
texts 包含558部小说的文本文件。文本在书页中断处有一个换行符,在页面中断处有两个换行符。部分文本(原版为Fraktur字体)经过后OCR校正,其余文本也经过后OCR校正。错误类型通过手动识别并在字典(Sprogteknologisk Ordbase, STO)中查找以避免创建新错误。这种方法谨慎,但并非所有错误都被纠正。
normalised 558部小说的正字法规范化版本。格式与“texts”文件相同,规范化为丹麦标准拼写。名词小写,aa改为å,频繁的字符模式改为符合1948年丹麦正字法规范。与错误校正版本类似,字符模式通过手动识别并在字典(Sprogteknologisk Ordbase, STO)中查找以避免过度生成。这种方法意味着并非所有单词都被规范化。
memo_all.vrt MeMo Corpus v1.1的VRT文件(垂直格式),用于在Corpus Workbench (CWB)中索引。格式:每行一个标记,由<corpus>、<text>和<sentence> XML元素分隔。XML元素包含元数据属性。标记通过制表符分隔,带有多种类别注释。有关元数据的更多信息,请参阅元数据Excel文件。有关标记注释的更多信息,请参见下文。
MeMo-corpus-metadata-v1.1-2023-06-20.xlsx 包含数据集中小说元数据的Excel文件。有关元数据类别的信息,请参阅“info”标签。

标记注释和元数据

VRT文件中的标记和注释有九列:

列1 列2 列3 列4 列5 列6 列7 列8 列9
标记 规范化形式 词形 词性 句子中的词序 行中的词序 书中的词序 页中的行序 页序

有关VRT文件中包含的元数据的更多信息,请参阅文件MeMo-corpus-metadata-v1.1-2023-06-20.xlsx。

参考文献

Bjerring-Hansen, Jens, et al. "Mending Fractured Texts. A heuristic procedure for correcting OCR data." (2022). https://ceur-ws.org/Vol-3232/paper14.pdf

数据声明

  • 数据集标题: MeMo Corpus
  • 数据集管理者: Jens Bjerring-Hansen, University of Copenhagen; Philip Diderichsen, University of Copenhagen; Dorte Haltrup Hansen, University of Copenhagen
  • 数据集版本: 1.1, 2023年8月15日
  • 作者: Jens Bjerring-Hansen, University of Copenhagen; Philip Diderichsen, University of Copenhagen
  • 数据声明版本: 1, 2023年9月25日

执行摘要

MeMo Corpus旨在通过自然语言处理和其他计算方法研究斯堪的纳维亚文化和社交历史中的文学和文化变革。该语料库包含1870-99年间在丹麦出版的挪威和丹麦作家的原创新小说,共858卷,总计450万句和6500万词。

文本特征

语料库由小说组成,即长篇叙事虚构作品,通常以散文形式出版。小说包含对话和描述,作为想象文学的实例,它们充满了歧义、解释上的混淆、修辞的复杂性以及作者、叙述者和角色之间的叙事层次。

从体裁角度看,语料库中的文本包括当代小说、历史小说和其他形式的类型小说,如浪漫、犯罪和战争故事。从美学角度看,既有前卫的现实主义形式,包括自然主义和印象主义的实例,也有更传统的散文,偏好抽象或概括而非具体说明。

语料库创建理由

MeMo Corpus是为研究项目“测量现代性:1870-1900年斯堪的纳维亚的文学和社会变革”创建的,该项目调查这一时期的社会变革如何在小说中反映和讨论。与传统历史学关注少数著名男性作者的选定文本不同,我们的数字语料库通过丰富的文本和作者元数据,能够捕捉到文学和社会学趋势,并提供对斯堪的纳维亚文学和社会历史中这一形成时期现代化过程的新见解。

语料库包括1870-99年间在丹麦出版的丹麦和挪威作家的所有原创新小说。文本列表基于《丹麦图书目录》(自1841年以来的连续出版物列表;自1861年起每年出版)以及文学手册和特殊书目编纂。

不包括在语料库中的主要是由于实际原因和为了连贯性:

  • 再版
  • 翻译
  • 连载(即报纸和杂志上的连载小说)
  • 流散文学(即美国丹麦移民作者的小说)

约20%的小说由女性作者创作,因此突出和探索这一时期常被忽视的女性文学作品是语料库和基于其探索的独特目标。

语言变体

语料库中的小说语言是19世纪末的丹麦语(BCP-47: da)。整体而言,我们处理的是一个或多或少语言上连贯的文本体。然而,以下情况必须注意:

  • 文本包含明显的拼写变异,部分在个体层面,部分由正在进行的正字法标准化解释,最明显的是1892年的拼写改革。
  • 一些书以方言(如日德兰或西挪威方言)或包含方言特征来在对话中创造心理个体性。
  • 约16%的书籍由挪威作者创作。在此应注意,直到1907年,书面挪威语实际上与书面丹麦语相同。“挪威语特征”(即丹麦人不使用的独特挪威词汇)确实出现。

预处理和数据格式化

OCR扫描: 书籍卷由丹麦皇家图书馆的按需数字化团队扫描,数据以全卷PDF文件形式交付,带有OCR文本作为不可见的可搜索、可复制文本层,以及全卷和单页文本文件。 OCR校正: 文本文件自动进行OCR错误后校正。这涉及两种不同过程,一种用于原版为Antikva(罗马字体)的文本,一种用于Fraktur(哥特字体)的文本。Antikva文件通过一组手工制作的替换模式进行校正,并在字典Sprogteknologisk Ordbase, STO中查找。Fraktur文件通过结合拼写校正、手工制作的模式替换和使用预训练的“Fraktur”Tesseract数据加上预训练的“dan”Tesseract数据的替代OCR层进行校正,特别是针对丹麦字符“æ”和“ø”的问题。这一过程将Fraktur数据的词错误率从10.46%提高到2.84%。 标记级注释: 校正后的数据通过丹麦CLARIN节点提供的Text Tonsorium工具进行语法信息注释。使用的特定管道包括LaPos词性标注器、CSTLemma词形还原器和Brill标注器的实现。语法信息包括词形和词性,以及句子和段落分割。除了语法注释外,还添加了各种计数的便利注释:句子中的词序、行中的词序、书中的词序、页中的行序、页序。 文本规范化: 经过OCR校正后,所有文本通过手工制作的替换模式和在STO中查找规范化为现代丹麦拼写。名词小写,“aa”改为“å”,频繁的字符模式改为符合现代丹麦正字法。 VRT转换: 在通过标记级类别和元数据注释后,数据转换为VRT文件(垂直格式),用于在Corpus Workbench (CWB)中索引。格式:每行一个标记,由<corpus>、<text>和<sentence> XML元素分隔。XML元素包含元数据属性。标记通过制表符分隔,带有上述标记级注释。

数据可作为:

  • OCR校正的全卷文本文件
  • 这些文本文件的规范化全卷版本
  • 包含整个语料库的单个VRT文件

限制

使用自动自然语言处理工具和程序预处理和注释的数据的标准限制是结果不完美。因此,基本上所有数据层都可能存在缺陷:

  • 文本数据:原始文本来自物理书卷的OCR扫描。这一过程并非完美,尽管我们采取了措施来减少错误,但基本文本层的数据仍可能存在2-3%的OCR错误(或错误校正)。
  • 规范化数据:规范化为现代丹麦拼写本身也不应期望完美。我们目前没有规范化数据的错误率估计。
  • 语法注释:这些也通过自动工具添加,不能期望产生完美结果。我们目前没有语法注释的错误率估计。
  • 元数据:元数据由文学学者手工编纂,应接近完美。然而,偶尔的人为错误当然不能排除。

元数据

元数据通过学生(Lasse Stein Holst, Lene Thanning Andersen, 和 Kirstine Nielsen Degn)基于《丹麦图书目录》(1861-)、https://www.litteraturpriser.dk/、Ehrencron-Müller:《匿名和假名词典》(1940)以及额外的文学和书目手册编纂。

元数据类别包括:

  • file_id
  • filename
  • [author] firstname
  • [author] surname
  • [author] pseudonym
  • [author] gender [m/f/unknown]
  • [author] nationality [da/no/unknown]
  • title
  • subtitle
  • volume
  • year [of publication]
  • pages [in total]
  • illustrations [y/n]
  • typeface [gothic/roman]
  • publisher
  • price

披露和伦理审查

创建和编纂的资助由卡尔斯伯格基金会通过授予Jens Bjerring-Hansen, 哥本哈根大学的年轻研究员奖学金提供。

在数据管理方面,项目数据(1870-1900年的小说)由非在世作者的想象文本组成。这些文本已超出版权保护期。从GDPR角度看,传记、书目和人口统计数据是历史的,也是非敏感的。

搜集汇总
背景与挑战
背景概述
MeMo语料库v1.1是一个丹麦文学数据集,包含1870-1899年“现代突破”时期的558部小说,用于研究斯堪的纳维亚文学和社会变化。数据集提供原始文本、正字法标准化版本和标注信息,语言为丹麦语,并经过OCR校正和元数据增强处理。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作