five

NicholasSynovic/Victorian-Era-Authorship-Attribution

收藏
Hugging Face2023-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NicholasSynovic/Victorian-Era-Authorship-Attribution
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: Victorian Era Authorship Attribution Data Set task_categories: - text-classification size_categories: - 10K<n<100K --- # Victorian Era Authorship Attribution Data Set > GUNGOR, ABDULMECIT, Benchmarking Authorship Attribution Techniques Using Over A Thousand Books by Fifty Victorian Era Novelists, Purdue Master of Thesis, 2018-04 ## NOTICE This dataset was downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) at [this link](https://archive.ics.uci.edu/ml/datasets/Victorian+Era+Authorship+Attribution). The [description](#description) of this dataset was copied from the source's dataset card. However, I have applied Markdown styling to prettify it and make it easier to navigate. ## Description > **Abstract**: To create the largest authorship attribution dataset, we extracted works of 50 well-known authors. To have a non-exhaustive learning, in training there are 45 authors whereas, in the testing, it's 50 ### Source They're extracted from the GDELT database. The GDELT Project is an open platform for research and analysis of global society and thus all datasets released by the GDELT Project are available for unlimited and unrestricted use for any academic, commercial, or governmental use of any kind without fee. ### Data Set Information To decrease the bias and create a reliable authorship attribution dataset the following criteria have been chosen to filter out authors in Gdelt database: English language writing authors, authors that have enough books available (at least 5), 19th century authors. With these criteria 50 authors have been selected and their books were queried through Big Query Gdelt database. The next task has been cleaning the dataset due to OCR reading problems in the original raw form. To achieve that, firstly all books have been scanned through to get the overall number of unique words and each words frequencies. While scanning the texts, the first 500 words and the last 500 words have been removed to take out specific features such as the name of the author, the name of the book and other word specific features that could make the classification task easier. After this step, we have chosen top 10,000 words that occurred in the whole 50 authors text data corpus. The words that are not in top 10,000 words were removed while keeping the rest of the sentence structure intact. The entire book is split into text fragments with 1000 words each. We separately maintained author and book identification number for each one of them in different arrays. Text segments with less than 1000 words were filled with zeros to keep them in the dataset as well. 1000 words make approximately 2 pages of writing, which is long enough to extract a variety of features from the document. Each instance in the training set consists of a text piece of 1000 words and an author id attached. In the testing set, there is only the text piece of 1000 words to do authorship attribution. Training data consists of 45 authors and testing data has 50 information. %34 of testing data is the percentile of unknown authors in the testing set. ### Attribute Information Each instance consists of 1000 word sequences that are divided from the works of every author's book. In the training, the author id is also provided. ### Relevant Papers * E. Stamatatos, A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology, 2009. ## Citation Request: * `GUNGOR, ABDULMECIT, Benchmarking Authorship Attribution Techniques Using Over A Thousand Books by Fifty Victorian Era Novelists, Purdue Master of Thesis, 2018-04`
提供机构:
NicholasSynovic
原始信息汇总

数据集概述

数据集名称

  • 名称: Victorian Era Authorship Attribution Data Set
  • 语言: 英语
  • 任务类别: 文本分类
  • 规模类别: 10K<n<100K

数据集描述

  • 摘要: 本数据集旨在创建最大的作者身份识别数据集,包含了50位知名作者的作品。训练集中包含45位作者的作品,测试集则包含50位作者的作品。

数据集来源

  • 数据来源于GDELT数据库,该数据库是一个开放的研究和分析全球社会的平台。所有数据集均可无限制地用于任何学术、商业或政府用途。

数据集信息

  • 为了减少偏差并创建可靠的作者身份识别数据集,筛选标准包括:使用英语写作的作者、至少有5本书的作者、19世纪的作者。
  • 数据集经过清理,去除了原始格式中的OCR阅读问题。
  • 每个实例包含从每位作者的书中提取的1000字文本片段,训练集中的实例还包括作者ID。

属性信息

  • 每个实例包含从作者书中提取的1000字序列。在训练集中,每个实例还包括作者ID。

相关论文

  • Stamatatos, E. (2009). A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology.

引用请求

  • GUNGOR, ABDULMECIT. Benchmarking Authorship Attribution Techniques Using Over A Thousand Books by Fifty Victorian Era Novelists, Purdue Master of Thesis, 2018-04.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作