recursal/LectureGratuits
收藏Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/recursal/LectureGratuits
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- crowdsourced
license: cc-by-sa-4.0
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
source_datasets:
- original
language:
- en
pretty_name: Lecture Gratuits
configs:
- config_name: default
data_files:
- split: final
path: data/*
---
# Dataset Card for LectureGratuits

*Waifu to catch your attention.*
## Dataset Details
### Dataset Description
*LectureGratuits* is a cleaned dataset of [*Ebooks Gratuits*](https://www.ebooksgratuits.com/) books. We downloaded all the publicly available ebooks books at the time and processed them.
Filtering to a total amount of tokens of **~265.46M** (llama-2-7b-chat-tokenizer) / **~253.51M** (RWKV Tokenizer) from primarily English language.
- **Curated by:** Darok
- **Funded by:** Recursal.ai
- **Shared by:** KaraKaraWitch
- **Language(s) (NLP):** English
- **License:** Public domain
### Dataset Sources
- **Source Data:** [ebooksgratuits.com](https://www.ebooksgratuits.com)
### Processing
KaraKaraWitch doesn't have specifics on how it's processed. We have postiluated the following workflow / processing:
0. Get the higher ID
1. Enumerate and download all the epub files: `https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub`
2. Put them in a folder called `books`
3. extract content to each json file in `output` folder. (See filtering steps in `extract-text.py`)
4. Combine into a single file.
### Data Keys
```
text (str): The book's text. Converted to markdown.
meta (dict): A dictionary of metadata with the following keys:
- title
- author
- publisher
```
### Dataset Curators
This dataset was mainly Darok's work. I (KaraKaraWitch) only assisted them with questions and the writing of the dataset card.
### Licensing Information
The books itself is in public domain. For the post processed data under Recursal work, it's licensed as CC-BY-SA.
Recursal Waifus (The banner image) are licensed under CC-BY-SA.
They do not represent the related websites in any official capacity unless otherwise or announced by the website.
You may use them as a banner image. However, you must always link back to the dataset.
### Citation Information
```
@ONLINE{lecturegratuits,
title = {LectureGratuits},
author = {Darok, KaraKaraWitch, recursal.ai},
year = {2024},
howpublished = {\url{https://huggingface.co/datasets/recursal/Recursalberg}},
}
```
提供机构:
recursal
原始信息汇总
LectureGratuits 数据集概述
数据集描述
- 名称: LectureGratuits
- 来源: ebooksgratuits.com
- 语言: 英语
- 总词数: 约265.46M(llama-2-7b-chat-tokenizer)/ 约253.51M(RWKV Tokenizer)
- 许可证: 公共领域
- 处理后数据许可证: CC-BY-SA
数据集来源
- 原始数据: ebooksgratuits.com
数据处理
- 获取最高ID
- 枚举并下载所有epub文件:
https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub - 将文件放入名为
books的文件夹 - 提取内容到
output文件夹中的每个json文件(参见extract-text.py中的过滤步骤) - 合并成单个文件
数据键
text (str): 书籍文本,转换为markdown格式meta (dict): 包含以下键的元数据字典:titleauthorpublisher
数据集贡献者
- 主要贡献者: Darok
- 协助者: KaraKaraWitch
许可证信息
- 书籍: 公共领域
- 处理后数据: CC-BY-SA
引用信息
@ONLINE{lecturegratuits, title = {LectureGratuits}, author = {Darok, KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Recursalberg}}, }



