Mosab-Rezaei/19th-century-novelists
收藏Hugging Face2025-11-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Mosab-Rezaei/19th-century-novelists
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- Stylometry
- Writing Style
- Authorship Attribution
- Text Generation
- Style Evaluation
- Style Classification
- Explainable AI (XAI)
- Prompt-based Generation
- Large Language Models
- LLMs
pretty_name: 19th-century novelists' sentences
size_categories:
- 100K<n<1M
---
<h1 align="center">19th-century novelists' sentences</h1>
<p align="center">
<img src="Authors.jpg" alt="Dataset overview">
</p>
We constructed the 5-author dataset using texts from Project Gutenberg, focusing on five prominent 19th-century novelists: Charles Dickens, Mark Twain, Herman Melville, Jane Austen, and Louisa May Alcott. This selection balances male and female authors as well as British and American literary traditions, offering a diverse testbed for stylistic analysis. Sentence segmentation was performed with the NLTK library, and tokenization/word counts were obtained with Stanford CoreNLP (v4.5.7). The final dataset contains 115,471 sentences. In addition to the raw sentences, the release includes rich annotations automatically extracted using Stanford CoreNLP, such as dependency relations, parse trees, and a variety of low-level and high-level syntactic features. These supplementary layers of linguistic information provide valuable resources for researchers interested not only in stylometric analysis but also in broader investigations of syntactic and semantic phenomena.
**If you use this dataset in your research, please cite the paper below in which it was introduced:**
**Paper:** "Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts"</br>
**GitHub:** https://github.com/mosabrezaei/Text-Generation-XAI</br>
**Cite:** </br>
@inproceedings{rezaei2025stylometry,</br>
title={Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts},</br>
author={Rezaei, Mosab and Rajaei Moghadam, Mina and Shaikh, Abdul Rahman and Alhoori, Hamed and Freedman, Reva},</br>
booktitle={ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)},</br>
doi={},</br>
pages={},</br>
month={},</br>
year={2025}}
---
license: 知识共享(CC)
task_categories:
- 文本生成(Text Generation)
- 文本分类(Text Classification)
language:
- 英语(en)
tags:
- 文体计量学(Stylometry)
- 写作风格(Writing Style)
- 作者归属(Authorship Attribution)
- 文本生成(Text Generation)
- 风格评估(Style Evaluation)
- 风格分类(Style Classification)
- 可解释人工智能(Explainable AI, XAI)
- 基于提示词生成(Prompt-based Generation)
- 大语言模型(Large Language Models)
- 大语言模型(LLMs)
pretty_name: 19世纪小说家语句数据集
size_categories:
- 100K < n < 1M(即10万至100万条数据)
---
<h1 align="center">19世纪小说家语句数据集</h1>
<p align="center">
<img src="Authors.jpg" alt="数据集概览">
</p>
我们依托古腾堡计划(Project Gutenberg)的文本资源,构建了包含5位作者的数据集,选取了五位19世纪知名小说家:查尔斯·狄更斯、马克·吐温、赫尔曼·梅尔维尔、简·奥斯汀与路易莎·梅·奥尔科特。该选本兼顾了男女作者群体与英美两大文学传统,为文体分析提供了多元化的测试基准平台。我们使用自然语言工具包(NLTK)完成语句切分,采用斯坦福核心自然语言处理工具包(Stanford CoreNLP,v4.5.7)实现分词与词数统计。最终数据集共包含115471条语句。除原始语句文本外,本次发布还附带了由斯坦福核心自然语言处理工具包自动提取的丰富标注信息,涵盖依存关系、句法树以及各类低层与高层句法特征。这些补充的语言信息层级,可为文体计量分析相关研究提供宝贵资源,同时也为句法与语义现象的更广泛探究提供支持。
**若您在研究中使用本数据集,请引用其首次发表的以下论文:**
**论文**:《使用单Token提示词的小说家风格生成、评估与可解释性》(原英文标题:*Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts*)
**GitHub仓库地址**:https://github.com/mosabrezaei/Text-Generation-XAI
**引用格式**:
@inproceedings{rezaei2025stylometry,
title={Generation, Evaluation, and Explanation of Novelists’ Styles with Single-Token Prompts},
author={Rezaei, Mosab and Rajaei Moghadam, Mina and Shaikh, Abdul Rahman and Alhoori, Hamed and Freedman, Reva},
booktitle={ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)},
doi={},
pages={},
month={},
year={2025}}
提供机构:
Mosab-Rezaei



