five

badrex/royal_society_corpus_metadata

收藏
Hugging Face2024-04-30 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/badrex/royal_society_corpus_metadata
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: issn dtype: string - name: title dtype: string - name: fpage dtype: string - name: lpage dtype: string - name: year dtype: int64 - name: volume dtype: int64 - name: journal dtype: string - name: author dtype: string - name: type dtype: string - name: corpusBuild dtype: string - name: doiLink dtype: string - name: language dtype: string - name: jrnl dtype: string - name: decade dtype: int64 - name: period dtype: int64 - name: century dtype: int64 - name: pages dtype: int64 - name: sentences dtype: int64 - name: tokens dtype: int64 - name: visualizationLink dtype: string - name: doi dtype: string - name: jstorLink dtype: string - name: hasAbstract dtype: float64 - name: isAbstractOf dtype: float64 - name: primaryTopic dtype: string - name: primaryTopicPercentage dtype: float64 - name: secondaryTopic dtype: string - name: secondaryTopicPercentage dtype: float64 - name: category dtype: string - name: tsne_embedding sequence: float32 - name: text dtype: string splits: - name: train num_bytes: 412915149 num_examples: 17520 download_size: 211087434 dataset_size: 412915149 configs: - config_name: default data_files: - split: train path: data/train-* license: cc language: - en tags: - science - royal_society size_categories: - 10K<n<100K --- ### Data Card for the Royal Society Corpus (RSC) Version 6.0 Open #### General Information - **Dataset Name**: Royal Society Corpus (RSC) 6.0 Open - **Repository URL**: [Royal Society Corpus Access](https://fedora.clarin-d.uni-saarland.de/rsc_v6/) - **Creator(s)**: Various authors contributing to the Philosophical Transactions of the Royal Society of London - **Maintained by**: Saarland University - **Dataset Version**: 6.0 Open - **License**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License #### Dataset Description ##### Abstract The RSC 6.0 encompasses over three centuries of scientific publications from the *Philosophical Transactions of the Royal Society*, ranging from its inception in 1665 to 1920. It includes all types of publications, predominantly in English, capturing the evolution of scientific discourse over time. ##### Content Description - **Content Type**: Text (Journal articles) - **Volume**: Approximately 78.6 million tokens - **Languages**: Primarily English - **Temporal Coverage**: 1665 - 1920 - **Fields**: Titles, Authors, Publication Dates, Text Bodies, Text Types (e.g., article, abstract) #### Data Quality - **Data Source**: Digitized texts from the Royal Society of London and other journals, provided by JSTOR in XML format - **Integrity and Processing**: Texts have undergone OCR processing with subsequent corrections; further enriched through linguistic annotation #### Data Structure and Accessibility - **Access**: The dataset is accessible for online search and can be downloaded in various formats including plain text and XML. - **Query Tool**: The data can be queried through the CQPweb server hosted by Saarland University after free registration. #### Utilization and Citation - **Use Cases**: Suitable for historical linguistics, diachronic studies of scientific writing, and training data for natural language processing applications focused on historical text. - **Citation**: For publications using the dataset, please cite these papers: ``` @inproceedings{fischer2020royal, title={The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study}, author={Fischer, Stefan and Knappen, J{\"o}rg and Menzel, Katrin and Teich, Elke}, booktitle={Proceedings of the 12th Language Resources and Evaluation Conference}, pages={794--802}, year={2020}, organization={European Language Resources Association}, url={https://www.aclweb.org/anthology/2020.lrec-1.99} } ``` ``` @inproceedings{kermes2016royal, title={The Royal Society Corpus: From Uncharted Data to Corpus}, author={Kermes, Hannah and Degaetano-Ortlieb, Stefania and Khamis, Ashraf and Knappen, J{\"o}rg and Teich, Elke}, booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation}, pages={1928--1931}, year={2016}, organization={European Language Resources Association}, url={https://www.aclweb.org/anthology/L16-1305} } ``` #### Additional Information - **Support and Funding**: The development of the RSC was supported by the German Research Foundation (DFG), the Federal Ministry of Education and Research (BMBF), and the CLARIN-D infrastructure.

The Royal Society Corpus (RSC) 6.0 Open dataset encompasses scientific publications from 1665 to 1920, primarily from the Philosophical Transactions of the Royal Society of London. It includes various types of publications, predominantly in English, capturing the evolution of scientific discourse over time. The content consists of text (journal articles), titles, authors, publication dates, text bodies, and text types. The data source is digitized texts from the Royal Society of London and other journals, provided by JSTOR in XML format. The texts have undergone OCR processing with subsequent corrections and are further enriched through linguistic annotation. The dataset is accessible for online search and can be downloaded in various formats including plain text and XML. It is suitable for historical linguistics, diachronic studies of scientific writing, and training data for natural language processing applications focused on historical text.
提供机构:
badrex
原始信息汇总

数据集概述

数据集名称

  • 名称: Royal Society Corpus (RSC) 6.0 Open

数据集描述

  • 内容类型: 文本(期刊文章)
  • 语言: 主要为英语
  • 时间范围: 1665 - 1920
  • 字段: 标题、作者、出版日期、文本内容、文本类型(如文章、摘要)
  • 数据量: 约78.6百万个词元

数据质量

  • 数据来源: 来自伦敦皇家学会和其他期刊的数字化文本,由JSTOR提供,格式为XML
  • 处理: 文本经过OCR处理并进行后续校正,进一步通过语言注释丰富

数据结构

  • 特征:
    • id: 字符串
    • issn: 字符串
    • title: 字符串
    • fpage: 字符串
    • lpage: 字符串
    • year: 整数
    • volume: 整数
    • journal: 字符串
    • author: 字符串
    • type: 字符串
    • corpusBuild: 字符串
    • doiLink: 字符串
    • language: 字符串
    • jrnl: 字符串
    • decade: 整数
    • period: 整数
    • century: 整数
    • pages: 整数
    • sentences: 整数
    • tokens: 整数
    • visualizationLink: 字符串
    • doi: 字符串
    • jstorLink: 字符串
    • hasAbstract: 浮点数
    • isAbstractOf: 浮点数
    • primaryTopic: 字符串
    • primaryTopicPercentage: 浮点数
    • secondaryTopic: 字符串
    • secondaryTopicPercentage: 浮点数
    • category: 字符串
    • tsne_embedding: 浮点数序列
    • text: 字符串

数据分割

  • 训练集:
    • 样本数: 17520
    • 字节数: 412915149

数据集大小

  • 下载大小: 211087434 字节
  • 总大小: 412915149 字节

许可证

  • 许可证: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

使用场景

  • 适用领域: 历史语言学、科学写作的历时研究、历史文本的自然语言处理应用的训练数据

引用

  • 引用格式:

    @inproceedings{fischer2020royal, title={The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study}, author={Fischer, Stefan and Knappen, J{"o}rg and Menzel, Katrin and Teich, Elke}, booktitle={Proceedings of the 12th Language Resources and Evaluation Conference}, pages={794--802}, year={2020}, organization={European Language Resources Association}, url={https://www.aclweb.org/anthology/2020.lrec-1.99} }

    @inproceedings{kermes2016royal, title={The Royal Society Corpus: From Uncharted Data to Corpus}, author={Kermes, Hannah and Degaetano-Ortlieb, Stefania and Khamis, Ashraf and Knappen, J{"o}rg and Teich, Elke}, booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation}, pages={1928--1931}, year={2016}, organization={European Language Resources Association}, url={https://www.aclweb.org/anthology/L16-1305} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作