jvdgoltz/dbnl.org-dutch-public-domain

Name: jvdgoltz/dbnl.org-dutch-public-domain
Creator: jvdgoltz
Published: 2024-02-09 12:59:59
License: 暂无描述

Hugging Face2024-02-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jvdgoltz/dbnl.org-dutch-public-domain

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: meta struct: - name: 'Unnamed: 28' dtype: string - name: _jaar dtype: int64 - name: achternaam dtype: string - name: bibliotheek dtype: string - name: categorie dtype: int64 - name: chapter dtype: int64 - name: druk dtype: string - name: edition dtype: string - name: geb_datum dtype: string - name: geb_land_code dtype: string - name: geb_plaats dtype: string - name: geb_plaats_code dtype: string - name: genre dtype: string - name: jaar dtype: string - name: jaar_geboren dtype: string - name: jaar_overlijden dtype: string - name: language dtype: string - name: maand dtype: string - name: overl_datum dtype: string - name: overl_land_code dtype: string - name: overl_plaats dtype: string - name: overl_plaats_code dtype: string - name: pers_id dtype: string - name: ppn_o dtype: string - name: revision_date dtype: string - name: section dtype: int64 - name: text_url dtype: string - name: ti_id dtype: string - name: titel dtype: string - name: url dtype: string - name: vols dtype: string - name: voornaam dtype: string - name: voorvoegsel dtype: string - name: vrouw dtype: int64 - name: text dtype: string - name: id dtype: string configs: - config_name: default data_files: - split: train path: train.parquet - split: validation path: validation.parquet task_categories: - text-generation - fill-mask language: - nl multilinguality: - monolingual size_categories: - 100K<n<1M license: - cc0-1.0 --- # Dataset Card for "dbnl.org-dutch-public-domain" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [DBNL Public Domain Collection](https://www.dbnl.org/letterkunde/pd/index.php) - **Point of Contact:** julian at vdgoltz.net ### Dataset Summary This dataset comprises a collection of texts from the Dutch Literature in the public domain, specifically from the DBNL (Digitale Bibliotheek voor de Nederlandse Letteren) public domain collection. The collection includes books, poems, songs, and other documentation, letters, etc., that are at least 140 years old and thus free of copyright restrictions. Each entry in the dataset corresponds to one section of a chapter of a text, ensuring a granular level of detail for text analysis. ### Supported Tasks and Leaderboards - Language Modeling - Text Generation - Other tasks that can benefit from historical Dutch texts ### Languages The dataset is primarily in Dutch (nl). ## Dataset Structure ### Data Instances A data instance corresponds to a section of a chapter of a document, including metadata such as title, author, publication year, and the text content itself. ### Data Fields - `ti_id`: Unique text identifier - `titel`: Title of the text - `jaar`: Publication year - `druk`: Edition - `bibliotheek`: Library code - `categorie`: Category ID - `pers_id`: Person ID - `voornaam`: Author's first name - `achternaam`: Author's last name - `url`: URL to the text - `text_url`: URL to the text in .txt format - `revision_date`: Date of the revision - `edition`: Edition details - `language`: Language of the text - `chapter`: Chapter number - `section`: Section number ### Data Splits The dataset is split into training and validation sets at text level (90:10), ensuring that sections or chapters from the same document do not leak from one split to another. ## Dataset Creation ### Curation Rationale The dataset was curated to make historical Dutch texts available for computational analysis, preserving cultural heritage and supporting research in the humanities and linguistic studies. ### Source Data #### Initial Data Collection and Normalization Data was collected from the DBNL's public domain collection, normalized, and structured to facilitate computational use. #### Who are the source language producers? The source language producers are authors of Dutch literature whose works have entered the public domain, implying their passing at least 70 years ago. ### Annotations The dataset does not contain annotations. ### Personal and Sensitive Information Given the historical nature of the texts, they are free from personal and sensitive information concerns in the contemporary sense. However, they reflect the social norms, biases, and cultural contexts of their time. ## Considerations for Using the Data ### Social Impact of Dataset The dataset serves as a valuable resource for understanding Dutch literary heritage, cultural history, and language evolution over time. It can support diverse research agendas in computational linguistics, cultural studies, and history. ### Discussion of Biases The texts contain biases prevalent at their time of publication, including colonialism, racism, sexism, and other societal norms of their era. Users are urged to consider these contexts critically and use the data responsibly. ### Other Known Limitations The dataset's historical nature means it may not be suitable for applications requiring contemporary language use or norms. ## Additional Information ### Dataset Curators This dataset was curated by https://huggingface.co/jvdgoltz, who is not affiliated with DBNL.org and did not act on their behalf. The data is sourced from the DBNL public domain collection. ### Licensing Information The texts in this dataset are in the public domain. According to Chat-GPT 4, the best fitting license would be: Creative Commons Zero v1.0 Universal, making them legally available for use, redistribution, and adaptation by anyone for any purpose. ### Citation Information Not applicable.

提供机构：

jvdgoltz

原始信息汇总

数据集描述

数据集概要

该数据集包含荷兰公共领域文学文本的集合，具体来自DBNL（荷兰文学数字图书馆）公共领域收藏。收藏包括书籍、诗歌、歌曲和其他文档、信件等，这些文本至少有140年历史，因此不受版权限制。数据集中的每个条目对应一个文本章节的一部分，确保了文本分析的详细程度。

支持的任务和排行榜

语言模型
文本生成
其他可以从历史荷兰文本中受益的任务

语言

该数据集主要使用荷兰语（nl）。

数据集结构

数据实例

一个数据实例对应一个文档章节的一部分，包括标题、作者、出版年份等元数据以及文本内容本身。

数据字段

ti_id: 唯一文本标识符
titel: 文本标题
jaar: 出版年份
druk: 版本
bibliotheek: 图书馆代码
categorie: 类别ID
pers_id: 人物ID
voornaam: 作者的名字
achternaam: 作者的姓氏
url: 文本的URL
text_url: 文本的.txt格式URL
revision_date: 修订日期
edition: 版本详情
language: 文本的语言
chapter: 章节编号
section: 节编号

数据分割

数据集在文本级别上分为训练集和验证集（90:10），确保同一文档的章节或部分不会从一个分割泄露到另一个分割。

数据集创建

策划理由

该数据集是为了使历史荷兰文本可用于计算分析，保护文化遗产并支持人文学科和语言学研究。

源数据

初始数据收集和规范化

数据从DBNL的公共领域收藏中收集，规范化并结构化以促进计算使用。

源语言生产者是谁？

源语言生产者是荷兰文学的作者，他们的作品已进入公共领域，意味着他们至少在70年前去世。

注释

该数据集不包含注释。

个人和敏感信息

鉴于文本的历史性质，它们在当代意义上没有个人和敏感信息问题。然而，它们反映了当时的社会规范、偏见和文化背景。

使用数据的考虑

数据集的社会影响

该数据集是理解荷兰文学遗产、文化历史和语言随时间演变的有价值资源。它支持计算语言学、文化研究和历史学等多样化的研究议程。

偏见的讨论

文本包含在其出版时普遍存在的偏见，包括殖民主义、种族主义、性别歧视和其他时代社会规范。用户被敦促批判性地考虑这些背景并负责任地使用数据。

其他已知限制

数据集的历史性质意味着它可能不适合需要当代语言使用或规范的应用。

附加信息

数据集策展人

该数据集由https://huggingface.co/jvdgoltz策展，他们与DBNL.org无关，并未代表他们行事。数据来源于DBNL公共领域收藏。

许可信息

该数据集中的文本属于公共领域。根据Chat-GPT 4，最适合的许可将是：Creative Commons Zero v1.0 Universal，使它们在法律上可用于任何人出于任何目的的使用、重新分发和改编。

引用信息

不适用。

5,000+

优质数据集

54 个

任务类型

进入经典数据集