fiction-1b
收藏魔搭社区2025-10-09 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/SaladTechnologies/fiction-1b
下载链接
链接失效反馈官方服务:
资源简介:
# Fiction 1B
More than 1B words of narrative fiction sourced from [Project Gutenberg](https://www.gutenberg.org/), [AO3](https://archiveofourown.org/), and [Internet Archive](https://archive.org/).
## Dataset Details
### Dataset Description
This contains the text of roughly 20,000 works of narrative fiction from the above sources.
From the original full texts, a [genre classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) was applied at the paragraph level to remove license text, metadata, and other content suspected not to be narrative prose.
#### Misc
- **Curated by:** Shawn Rushefsky - [🤗](https://huggingface.co/shawnrushefsky) | [github](https://github.com/shawnrushefsky)
- **Funded by:** [Salad Technologies](https://salad.com)
- **Language(s) (NLP):** English
- **License:** MIT
### Dataset Sources
More information about specific source documents can be found in `doc_index.csv`
- Project Gutenberg: 76.4%
- Archive of our Own (AO3): 22.2%
- Internet Archive: 1.4%
## Uses
The dataset is intended to be used for training language models on the syntactic patterns of narrative fiction.
### Direct Use
- Fill-Mask training
- Text Generation training
- Research
### Out-of-Scope Use
- Applications outside of fiction
## Dataset Structure
`data.zip` contains a CSV file where each row contains the source, a document ID, paragraph index, approximately 500 words of text, and a word count for that section.
## Dataset Creation
### Curation Rationale
While much of this content is already present in extremely large web-scaped datasets, there is a scarcity of more approachable medium-sized datasets that focus specifically on narrative fiction.
Datasets such as [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) with trillions of tokens are not practical for the average developer to work with.
### Source Data
#### Data Collection and Processing
**Project Gutenberg**
Project Gutenberg hosts a [catalog CSV](https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv) that includes metadata such as title, author, and subjects.
I filtered based on the presence of fiction-related keywords in the Subjects column, and used a python script to bulk download texts.
```python
fiction_keywords = [
'fiction', 'novel', 'stories', 'tale', 'adventure',
'mystery', 'romance', 'fantasy', 'horror', 'detective',
'science fiction', 'historical fiction', 'western',
'thriller', 'suspense'
]
```
**AO3**
For AO3, I used the [ao3-api](https://pypi.org/project/ao3-api/) python package to gradually paginate through the archive, filtering to English language work with at least 15,000 words but fewer than 500,000, sorted by “Kudos”, a measure of user favor.
**Internet Archive**
For Internet Archive, I used their search endpoint, and a significant amount of keyword filtering.
Ultimately I did not get much content from this source due to licensing restrictions.
#### Who are the source data producers?
Professional and amateur writers of long-form narrative fiction in the English language over the last few hundred years.
#### Personal and Sensitive Information
This dataset contains only works of fiction.
## Bias, Risks, and Limitations
The source text comes from a diverse set of english-language narrative fiction spanning hundreds of years of authorship, and may include subject matter and phrasing that offend.
The age of much of the material from Project Gutenberg is such that white men from before the civil rights movement are vastly disproportionately represented as authors.
Additionally, contemporary commercial fiction is nearly all but excluded due to licensing restrictions.
### Recommendations
Use at your own risk.
# Fiction 1B
本数据集包含超过10亿词的叙事小说文本,数据来源于[古腾堡计划(Project Gutenberg)](https://www.gutenberg.org/)、[AO3(Archive of Our Own)](https://archiveofourown.org/)以及[互联网档案馆(Internet Archive)](https://archive.org/)。
## 数据集详情
### 数据集描述
本数据集涵盖来自上述平台的约20000部叙事小说文本。我们采用了[多语言文本体裁分类器(genre classifier)](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier),在段落级别对原始完整文本进行处理,移除授权声明、元数据以及其他疑似非叙事性散文的内容。
#### 杂项
- **整理者:** 肖恩·鲁谢夫斯基(Shawn Rushefsky) - [🤗](https://huggingface.co/shawnrushefsky) | [GitHub](https://github.com/shawnrushefsky)
- **资助方:** [Salad Technologies](https://salad.com)
- **自然语言处理语种:** 英语
- **授权协议:** MIT协议
### 数据集来源
关于特定源文档的更多信息可在`doc_index.csv`文件中查看。
- 古腾堡计划:76.4%
- AO3(Archive of Our Own):22.2%
- 互联网档案馆:1.4%
## 数据集用途
本数据集旨在用于训练语言模型,以学习叙事小说的句法模式。
### 直接用途
- 掩码填充(Fill-Mask)训练
- 文本生成训练
- 学术研究
### 超范围使用
非小说类应用场景。
## 数据集结构
`data.zip`压缩包内含一个CSV文件,文件中每一行数据包含来源、文档ID、段落索引、约500词的文本内容以及该段落的词数统计。
## 数据集构建
### 整理依据
尽管当前已有诸多超大规模网页抓取数据集包含此类内容,但专门聚焦叙事小说、且规模适中便于使用的数据集仍较为稀缺。例如拥有数万亿Token的[FineWeb数据集](https://huggingface.co/datasets/HuggingFaceFW/fineweb),对于普通开发者而言并不便于实际使用。
### 源数据
#### 数据收集与处理流程
**古腾堡计划**
古腾堡计划提供了一份[目录CSV文件](https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv),其中包含标题、作者、主题等元数据。我们通过筛选“主题(Subjects)”列中包含小说相关关键词的条目,并使用Python脚本批量下载文本。
python
fiction_keywords = [
'fiction', 'novel', 'stories', 'tale', 'adventure',
'mystery', 'romance', 'fantasy', 'horror', 'detective',
'science fiction', 'historical fiction', 'western',
'thriller', 'suspense'
]
**AO3**
针对AO3平台,我们使用[ao3-api](https://pypi.org/project/ao3-api/) Python包逐步分页遍历该档案库,筛选出英语创作、字数介于15000至500000词之间的作品,并按用户点赞量“Kudos”进行排序。
**互联网档案馆**
针对互联网档案馆,我们使用其搜索接口并进行了大量关键词筛选。最终由于授权限制,从该平台获取的内容较少。
#### 源数据生产者是谁?
过去数百年间的英语长篇叙事小说的专业与业余创作者。
#### 个人与敏感信息
本数据集仅包含小说作品。
## 偏差、风险与局限性
源文本来自跨越数百年创作历史的多样化英语叙事小说,可能包含令人不适的主题内容与措辞。古腾堡计划中的多数内容年代久远,民权运动前的白人男性创作者在作者群体中占比极高。此外,由于授权限制,当代商业小说几乎未被纳入本数据集。
### 使用建议
使用风险自负。
提供机构:
maas
创建时间:
2025-09-16



