recursal/Recursalberg
收藏Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/recursal/Recursalberg
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- crowdsourced
license:
- cc-by-sa-4.0
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
source_datasets:
- original
language:
- en
configs:
- config_name: default
data_files:
- split: final
path: jsonl/*
pretty_name: Recursalberg
---
# Dataset Card for Recursalberg

*Waifu to catch your attention.*
## Dataset Details
### Dataset Description
*Recursalberg* is a cleaned dataset of Project Gutenberg books. We downloaded all the publicly available Gutenberg books at the time and processed them.
Filtering to a total amount of tokens of **~5.32B** (llama-2-7b-chat-tokenizer) / **~4.83B** (RWKV Tokenizer) from primarily English language.
- **Curated by:** KaraKaraWitch
- **Funded by:** Recursal.ai (I work there lol)
- **Shared by:** KaraKaraWitch
- **Language(s) (NLP):** Primarily English
- **License:** cc-by-sa-4.0
### Dataset Sources
- **Source Data:** [gutenberg.org (see mirroring)](https://gutenberg.org/help/mirroring.html) (rclone download)
### Processing
We performed the following downloading and processing steps to prepare Retenberg.
1. Use gutenberg's mirror rclone to download to a folder.
2. Index gutenberg with `gutenberg_index.py` (Gather all items to find html documents.)
3. Process all book indexes that has a html version.
- Remove Gutenberg pre html block, page numbers, html comments, table of contents
- Convert sections to markdown
- Clean new lines, standardize punctuations.
4. Save each html file into 1 jsonl file.
### Data Keys
```
text (str): the book's text. converted to markdown.
```
## Recursal's Vision
> To make AI accessible to everyone, regardless of language, or economical status
This is the collective goal of the `RWKV Open Source foundation` and `Recursal AI`, the commercial entity who backs it.
We believe that AI should not be controlled by a select few individual organization. And that it should be made accessible regardless if you are rich or poor, or a native speaker of english.
### About RWKV
RWKV is an Open Source, non profit group, under the linux foundation. Focused on developing the RWKV AI architecture, in accordence to our vision.
The RWKV architecture scales efficiently and economically. As an RNN & Transformer hybrid, it is able to provide the performance similar to leading transformer models, while having the compute and energy efficiency of an RNN based architecture.
You can find out more about the project, and latest models, at the following
- [https://blog.rwkv.com](https://blog.rwkv.com)
- [https://wiki.rwkv.com](https://wiki.rwkv.com)
### About Recursal AI
Recursal AI, is the commercial entity built to provide support for RWKV model development and users, while providing commercial services via its public cloud, or private-cloud / on-premise offerings.
As part of our vision. Our commitment, is to ensure open source development and access to the best foundational AI models and datasets.
The following dataset/models provided here, is part of that commitment.
You can find out more about recursal AI here
- [https://recursal.ai](https://recursal.ai)
- [https://blog.recursal.ai](https://blog.recursal.ai)
### Dataset Curators
KaraKaraWitch. (I typically hangout in PygmalionAI discord, sometimes EleutherAI. If something is wrong, `@karakarawitch` on discord.)
I'd be happy if you could spread the word and recommend this dataset for your use cases `:)`
### Licensing Information
Complicated. Refer to gutenberg's license [here](https://www.gutenberg.org/policy/license.html),
We haven't seen any in-copyrighted books in our dataset so far so we assume it's safe for usage, unless otherwise.
For the post processed data under Recursal work, it's licensed as CC-BY-SA.
Recursal Waifus (The banner image) are licensed under CC-BY-SA.
They do not represent the related websites in any official capacity unless otherwise or announced by the website.
You may use them as a banner image. However, you must always link back to the dataset.
### Citation Information
```
@ONLINE{recursalberg,
title = {Recursalberg},
author = {KaraKaraWitch, recursal.ai},
year = {2024},
howpublished = {\url{https://huggingface.co/datasets/recursal/Recursalberg}},
}
```
提供机构:
recursal
原始信息汇总
数据集概述
数据集描述
Recursalberg 是一个经过清洗的Project Gutenberg图书数据集。该数据集包含了当时所有公开可用的Gutenberg图书,并进行了处理。过滤后的总token数量约为**~5.32B**(使用llama-2-7b-chat-tokenizer)或**~4.83B**(使用RWKV Tokenizer),主要语言为英语。
- 数据集创建者: KaraKaraWitch
- 资助方: Recursal.ai
- 共享者: KaraKaraWitch
- 语言: 主要为英语
- 许可证: cc-by-sa-4.0
数据集来源
- 原始数据来源: gutenberg.org (see mirroring) (使用rclone下载)
数据处理步骤
- 使用Gutenberg的镜像rclone下载到文件夹。
- 使用
gutenberg_index.py索引Gutenberg(收集所有项目以查找html文档)。 - 处理所有具有html版本的图书索引:
- 移除Gutenberg的pre html块、页码、html注释、目录
- 将章节转换为markdown格式
- 清理新行,标准化标点符号
- 将每个html文件保存为1个jsonl文件。
数据键
text (str): 图书的文本,已转换为markdown格式。
许可证信息
- 原始数据许可证: 参考Gutenberg的许可证 here
- 处理后数据许可证: CC-BY-SA
引用信息
@ONLINE{recursalberg, title = {Recursalberg}, author = {KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Recursalberg}}, }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



