---
dataset_info:
features:
- name: id
dtype: string
- name: issn
dtype: string
- name: title
dtype: string
- name: fpage
dtype: string
- name: lpage
dtype: string
- name: year
dtype: int64
- name: volume
dtype: int64
- name: journal
dtype: string
- name: author
dtype: string
- name: type
dtype: string
- name: corpusBuild
dtype: string
- name: doiLink
dtype: string
- name: language
dtype: string
- name: jrnl
dtype: string
- name: decade
dtype: int64
- name: period
dtype: int64
- name: century
dtype: int64
- name: pages
dtype: int64
- name: sentences
dtype: int64
- name: tokens
dtype: int64
- name: visualizationLink
dtype: string
- name: doi
dtype: string
- name: jstorLink
dtype: string
- name: hasAbstract
dtype: float64
- name: isAbstractOf
dtype: float64
- name: primaryTopic
dtype: string
- name: primaryTopicPercentage
dtype: float64
- name: secondaryTopic
dtype: string
- name: secondaryTopicPercentage
dtype: float64
- name: category
dtype: string
- name: tsne_embedding
sequence: float32
- name: text
dtype: string
splits:
- name: train
num_bytes: 412915149
num_examples: 17520
download_size: 211087434
dataset_size: 412915149
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc
language:
- en
tags:
- science
- royal_society
size_categories:
- 10K<n<100K
---
### Data Card for the Royal Society Corpus (RSC) Version 6.0 Open
#### General Information
- **Dataset Name**: Royal Society Corpus (RSC) 6.0 Open
- **Repository URL**: [Royal Society Corpus Access](https://fedora.clarin-d.uni-saarland.de/rsc_v6/)
- **Creator(s)**: Various authors contributing to the Philosophical Transactions of the Royal Society of London
- **Maintained by**: Saarland University
- **Dataset Version**: 6.0 Open
- **License**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
#### Dataset Description
##### Abstract
The RSC 6.0 encompasses over three centuries of scientific publications from the *Philosophical Transactions of the Royal Society*, ranging from its inception in 1665 to 1920. It includes all types of publications, predominantly in English, capturing the evolution of scientific discourse over time.
##### Content Description
- **Content Type**: Text (Journal articles)
- **Volume**: Approximately 78.6 million tokens
- **Languages**: Primarily English
- **Temporal Coverage**: 1665 - 1920
- **Fields**: Titles, Authors, Publication Dates, Text Bodies, Text Types (e.g., article, abstract)
#### Data Quality
- **Data Source**: Digitized texts from the Royal Society of London and other journals, provided by JSTOR in XML format
- **Integrity and Processing**: Texts have undergone OCR processing with subsequent corrections; further enriched through linguistic annotation
#### Data Structure and Accessibility
- **Access**: The dataset is accessible for online search and can be downloaded in various formats including plain text and XML.
- **Query Tool**: The data can be queried through the CQPweb server hosted by Saarland University after free registration.
#### Utilization and Citation
- **Use Cases**: Suitable for historical linguistics, diachronic studies of scientific writing, and training data for natural language processing applications focused on historical text.
- **Citation**: For publications using the dataset, please cite these papers:
```
@inproceedings{fischer2020royal,
title={The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study},
author={Fischer, Stefan and Knappen, J{\"o}rg and Menzel, Katrin and Teich, Elke},
booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
pages={794--802},
year={2020},
organization={European Language Resources Association},
url={https://www.aclweb.org/anthology/2020.lrec-1.99}
}
```
```
@inproceedings{kermes2016royal,
title={The Royal Society Corpus: From Uncharted Data to Corpus},
author={Kermes, Hannah and Degaetano-Ortlieb, Stefania and Khamis, Ashraf and Knappen, J{\"o}rg and Teich, Elke},
booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation},
pages={1928--1931},
year={2016},
organization={European Language Resources Association},
url={https://www.aclweb.org/anthology/L16-1305}
}
```
#### Additional Information
- **Support and Funding**: The development of the RSC was supported by the German Research Foundation (DFG), the Federal Ministry of Education and Research (BMBF), and the CLARIN-D infrastructure.
The Royal Society Corpus (RSC) 6.0 Open dataset encompasses scientific publications from 1665 to 1920, primarily from the Philosophical Transactions of the Royal Society of London. It includes various types of publications, predominantly in English, capturing the evolution of scientific discourse over time. The content consists of text (journal articles), titles, authors, publication dates, text bodies, and text types. The data source is digitized texts from the Royal Society of London and other journals, provided by JSTOR in XML format. The texts have undergone OCR processing with subsequent corrections and are further enriched through linguistic annotation. The dataset is accessible for online search and can be downloaded in various formats including plain text and XML. It is suitable for historical linguistics, diachronic studies of scientific writing, and training data for natural language processing applications focused on historical text.