---
license: apache-2.0
language:
- zh
tags:
- novel
- training
task_categories:
- text-classification
- text-generation
pretty_name: CNNovel125K
size_categories:
- 100K<n<1M
duplicated_from: RyokoAI/CNNovel125K
---
# Dataset Card for CNNovel125K
*The BigKnow2022 dataset and its subsets are not yet complete. Not all information here may be accurate or accessible.*
## Dataset Description
- **Homepage:** (TODO)
- **Repository:** <https://github.com/RyokoAI/BigKnow2022>
- **Paper:** N/A
- **Leaderboard:** N/A
- **Point of Contact:** Ronsor/undeleted <ronsor@ronsor.com>
### Dataset Summary
CNNovel125K is a dataset composed of approximately 125,000 novels downloaded from the Chinese novel hosting site <http://ibiquw.com>.
### Supported Tasks and Leaderboards
This dataset is primarily intended for unsupervised training of text generation models; however, it may be useful for other purposes.
* text-classification
* text-generation
### Languages
* Simplified Chinese
## Dataset Structure
### Data Instances
```json
{
"text": "\n------------\n\n全部章节\n\n\n------------\n\n第一章 她肯定做梦呢!\n\n HT国际大酒店总统套房。\n\n 清晨的第一缕阳光照射进圣地亚哥地板上,洒落在凌乱的床单上,突然地,床上睡的正熟的人睁开眼睛,
猛然惊醒!\n\n ...",
"meta": {
"subset": "cnnovel.ibiquw",
"id": "100067",
"q": 0.9,
"lang": "zh_cn",
"title": "为爱入局:嫁给秦先生",
"author": "奥德萨"
}
}
{
"text": "\n------------\n\n全部章节\n\n\n------------\n\n第1章:出狱就大婚\n\n 凉城第一监狱,大门缓缓打开,秦峰仰起头,贪婪的呼吸了一口空气。\n\n 三年了,终于又闻到了自由的味道。\n\n 他回过头,看着目
送他出来的那群人道:...",
"meta": {
"subset": "cnnovel.ibiquw",
"id": "100059",
"q": 0.9,
"lang": "zh_cn",
"title": "绝世弃婿",
"author": "绷带怪"
}
}
```
### Data Fields
* `text`: the actual novel text, all chapters
* `meta`: entry metadata
* `subset`: dataset tag: `cnnovel.ibiquw`
* `id`: novel ID
* `q`: quality score, fixed at 0.9
* `lang`: always `zh_cn` (Simplified Chinese)
* `title`: novel title
* `author`: novel author
### Data Splits
No splitting of the data was performed.
## Dataset Creation
### Curation Rationale
TODO
### Source Data
#### Initial Data Collection and Normalization
TODO
#### Who are the source language producers?
The authors of each novel.
### Annotations
#### Annotation process
Titles were collected alongside the novel text and IDs.
#### Who are the annotators?
There were no human annotators.
### Personal and Sensitive Information
The dataset contains only works of fiction, and we do not believe it contains any PII.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset is intended to be useful for anyone who wishes to train a model to generate "more entertaining" content in Chinese.
It may also be useful for other languages depending on your language model.
### Discussion of Biases
This dataset is composed of fictional works by various authors. Because of this fact, the contents of this dataset will reflect
the biases of those authors. Beware of stereotypes.
### Other Known Limitations
N/A
## Additional Information
### Dataset Curators
Ronsor Labs
### Licensing Information
Apache 2.0, for all parts of which Ronsor Labs or the Ryoko AI Production Committee may be considered authors. All other material is
distributed under fair use principles.
### Citation Information
```
@misc{ryokoai2023-bigknow2022,
title = {BigKnow2022: Bringing Language Models Up to Speed},
author = {Ronsor},
year = {2023},
howpublished = {\url{https://github.com/RyokoAI/BigKnow2022}},
}
```
### Contributions
Thanks to @ronsor (GH) for gathering this dataset.
许可证:Apache-2.0
语言:
- 中文
标签:
- 小说
- 训练
任务类别:
- 文本分类
- 文本生成
美观名称:CNNovel125K
规模类别:10万<n<100万
重复来源:RyokoAI/CNNovel125K
---
# CNNovel125K数据集卡片
*BigKnow2022数据集及其子集尚未完善,本文档部分信息可能不准确或无法获取。*
## 数据集说明
- **主页:**(待补充)
- **代码仓库:** <https://github.com/RyokoAI/BigKnow2022>
- **论文:** 无
- **排行榜:** 无
- **联系方式:** Ronsor/undeleted <ronsor@ronsor.com>
### 数据集摘要
CNNovel125K是一个由约12.5万部从中文小说托管网站<http://ibiquw.com>下载的小说组成的数据集。
### 支持任务与排行榜
本数据集主要用于文本生成模型的无监督训练,但也可应用于其他场景。
* 文本分类
* 文本生成
### 语言
* 简体中文
## 数据集结构
### 数据样例
json
{
"text": "
------------
全部章节
------------
第一章 她肯定做梦呢!
HT国际大酒店总统套房。
清晨的第一缕阳光照射进圣地亚哥地板上,洒落在凌乱的床单上,突然地,床上睡的正熟的人睁开眼睛,
猛然惊醒!
...",
"meta": {
"subset": "cnnovel.ibiquw",
"id": "100067",
"q": 0.9,
"lang": "zh_cn",
"title": "为爱入局:嫁给秦先生",
"author": "奥德萨"
}
}
{
"text": "
------------
全部章节
------------
第1章:出狱就大婚
凉城第一监狱,大门缓缓打开,秦峰仰起头,贪婪的呼吸了一口空气。
三年了,终于又闻到了自由的味道。
他回过头,看着目送他出来的那群人道:...",
"meta": {
"subset": "cnnovel.ibiquw",
"id": "100059",
"q": 0.9,
"lang": "zh_cn",
"title": "绝世弃婿",
"author": "绷带怪"
}
}
### 数据字段
* `text`:实际小说正文,包含所有章节
* `meta`:条目元数据
* `subset`:数据集标签:`cnnovel.ibiquw`
* `id`:小说ID
* `q`:质量评分,固定为0.9
* `lang`:固定为`zh_cn`(简体中文)
* `title`:小说标题
* `author`:小说作者
### 数据划分
未对数据集进行划分。
## 数据集构建
### 遴选依据
待补充
### 源数据
#### 初始数据收集与归一化
待补充
#### 源语言内容创作者是谁?
各小说的作者。
### 标注信息
#### 标注流程
标题与小说正文、ID一同被收集。
#### 标注人员是谁?
无人工标注人员。
### 个人与敏感信息
本数据集仅包含虚构作品,我们认为其中不包含任何个人可识别信息(PII)。
## 数据集使用注意事项
### 数据集的社会影响
本数据集旨在为希望训练模型生成中文“更具娱乐性”内容的用户提供支持,根据所用大语言模型的不同,也可应用于其他语言场景。
### 偏差讨论
本数据集由不同作者的虚构作品组成,因此数据集内容会反映这些作者的固有偏见,请注意规避刻板印象。
### 其他已知局限性
无
## 附加信息
### 数据集维护者
Ronsor Labs
### 许可证信息
对于Ronsor Labs或Ryoko AI制作委员会可被视为作者的所有内容,采用Apache 2.0许可证。其余内容均遵循合理使用原则进行分发。
### 引用信息
@misc{ryokoai2023-bigknow2022,
title = {BigKnow2022: Bringing Language Models Up to Speed},
author = {Ronsor},
year = {2023},
howpublished = {url{https://github.com/RyokoAI/BigKnow2022}},
}
### 致谢
感谢@ronsor(GitHub账号)收集本数据集。