recursal/Recursalberg

Name: recursal/Recursalberg
Creator: recursal
Published: 2024-06-13 01:28:17
License: 暂无描述

Hugging Face2024-06-13 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/recursal/Recursalberg

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - crowdsourced license: - cc-by-sa-4.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling source_datasets: - original language: - en configs: - config_name: default data_files: - split: final path: jsonl/* pretty_name: Recursalberg --- # Dataset Card for Recursalberg ![](Recursalberg.png "Clara is a dedicated volunteer and digital archivist. Inspired by her ancestor, she is committed to making literature accessible to everyone. With a background in library science and a deep appreciation for cultural works, Clara spends her days digitizing rare books and proofreading texts. Her character design reflects her role as a guardian of knowledge, bridging the gap between the historical significance of printed books and the accessibility of digital formats. Clara's serene and knowledgeable presence makes her a relatable and inspiring figure for readers and volunteers alike.") *Waifu to catch your attention.* ## Dataset Details ### Dataset Description *Recursalberg* is a cleaned dataset of Project Gutenberg books. We downloaded all the publicly available Gutenberg books at the time and processed them. Filtering to a total amount of tokens of **~5.32B** (llama-2-7b-chat-tokenizer) / **~4.83B** (RWKV Tokenizer) from primarily English language. - **Curated by:** KaraKaraWitch - **Funded by:** Recursal.ai (I work there lol) - **Shared by:** KaraKaraWitch - **Language(s) (NLP):** Primarily English - **License:** cc-by-sa-4.0 ### Dataset Sources - **Source Data:** [gutenberg.org (see mirroring)](https://gutenberg.org/help/mirroring.html) (rclone download) ### Processing We performed the following downloading and processing steps to prepare Retenberg. 1. Use gutenberg's mirror rclone to download to a folder. 2. Index gutenberg with `gutenberg_index.py` (Gather all items to find html documents.) 3. Process all book indexes that has a html version. - Remove Gutenberg pre html block, page numbers, html comments, table of contents - Convert sections to markdown - Clean new lines, standardize punctuations. 4. Save each html file into 1 jsonl file. ### Data Keys ``` text (str): the book's text. converted to markdown. ``` ## Recursal's Vision > To make AI accessible to everyone, regardless of language, or economical status This is the collective goal of the `RWKV Open Source foundation` and `Recursal AI`, the commercial entity who backs it. We believe that AI should not be controlled by a select few individual organization. And that it should be made accessible regardless if you are rich or poor, or a native speaker of english. ### About RWKV RWKV is an Open Source, non profit group, under the linux foundation. Focused on developing the RWKV AI architecture, in accordence to our vision. The RWKV architecture scales efficiently and economically. As an RNN & Transformer hybrid, it is able to provide the performance similar to leading transformer models, while having the compute and energy efficiency of an RNN based architecture. You can find out more about the project, and latest models, at the following - [https://blog.rwkv.com](https://blog.rwkv.com) - [https://wiki.rwkv.com](https://wiki.rwkv.com) ### About Recursal AI Recursal AI, is the commercial entity built to provide support for RWKV model development and users, while providing commercial services via its public cloud, or private-cloud / on-premise offerings. As part of our vision. Our commitment, is to ensure open source development and access to the best foundational AI models and datasets. The following dataset/models provided here, is part of that commitment. You can find out more about recursal AI here - [https://recursal.ai](https://recursal.ai) - [https://blog.recursal.ai](https://blog.recursal.ai) ### Dataset Curators KaraKaraWitch. (I typically hangout in PygmalionAI discord, sometimes EleutherAI. If something is wrong, `@karakarawitch` on discord.) I'd be happy if you could spread the word and recommend this dataset for your use cases `:)` ### Licensing Information Complicated. Refer to gutenberg's license [here](https://www.gutenberg.org/policy/license.html), We haven't seen any in-copyrighted books in our dataset so far so we assume it's safe for usage, unless otherwise. For the post processed data under Recursal work, it's licensed as CC-BY-SA. Recursal Waifus (The banner image) are licensed under CC-BY-SA. They do not represent the related websites in any official capacity unless otherwise or announced by the website. You may use them as a banner image. However, you must always link back to the dataset. ### Citation Information ``` @ONLINE{recursalberg, title = {Recursalberg}, author = {KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/recursal/Recursalberg}}, } ```

提供机构：

recursal

原始信息汇总

数据集概述

数据集描述

Recursalberg 是一个经过清洗的Project Gutenberg图书数据集。该数据集包含了当时所有公开可用的Gutenberg图书，并进行了处理。过滤后的总token数量约为**~5.32B**（使用llama-2-7b-chat-tokenizer）或**~4.83B**（使用RWKV Tokenizer），主要语言为英语。

数据集创建者: KaraKaraWitch
资助方: Recursal.ai
共享者: KaraKaraWitch
语言: 主要为英语
许可证: cc-by-sa-4.0

数据集来源

原始数据来源: gutenberg.org (see mirroring) (使用rclone下载)

数据处理步骤

使用Gutenberg的镜像rclone下载到文件夹。
使用gutenberg_index.py索引Gutenberg（收集所有项目以查找html文档）。
处理所有具有html版本的图书索引：
- 移除Gutenberg的pre html块、页码、html注释、目录
- 将章节转换为markdown格式
- 清理新行，标准化标点符号
将每个html文件保存为1个jsonl文件。

数据键

text (str): 图书的文本，已转换为markdown格式。

许可证信息

原始数据许可证: 参考Gutenberg的许可证 here
处理后数据许可证: CC-BY-SA

引用信息

@ONLINE{recursalberg, title = {Recursalberg}, author = {KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Recursalberg}}, }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集