recursal/LectureGratuits

Name: recursal/LectureGratuits
Creator: recursal
Published: 2024-06-13 01:26:42
License: 暂无描述

Hugging Face2024-06-13 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/recursal/LectureGratuits

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - crowdsourced license: cc-by-sa-4.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling source_datasets: - original language: - en pretty_name: Lecture Gratuits configs: - config_name: default data_files: - split: final path: data/* --- # Dataset Card for LectureGratuits ![](Recursalberg.png "Clara makes her return. See recursal/Recursalberg for her description!") *Waifu to catch your attention.* ## Dataset Details ### Dataset Description *LectureGratuits* is a cleaned dataset of [*Ebooks Gratuits*](https://www.ebooksgratuits.com/) books. We downloaded all the publicly available ebooks books at the time and processed them. Filtering to a total amount of tokens of **~265.46M** (llama-2-7b-chat-tokenizer) / **~253.51M** (RWKV Tokenizer) from primarily English language. - **Curated by:** Darok - **Funded by:** Recursal.ai - **Shared by:** KaraKaraWitch - **Language(s) (NLP):** English - **License:** Public domain ### Dataset Sources - **Source Data:** [ebooksgratuits.com](https://www.ebooksgratuits.com) ### Processing KaraKaraWitch doesn't have specifics on how it's processed. We have postiluated the following workflow / processing: 0. Get the higher ID 1. Enumerate and download all the epub files: `https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub` 2. Put them in a folder called `books` 3. extract content to each json file in `output` folder. (See filtering steps in `extract-text.py`) 4. Combine into a single file. ### Data Keys ``` text (str): The book's text. Converted to markdown. meta (dict): A dictionary of metadata with the following keys: - title - author - publisher ``` ### Dataset Curators This dataset was mainly Darok's work. I (KaraKaraWitch) only assisted them with questions and the writing of the dataset card. ### Licensing Information The books itself is in public domain. For the post processed data under Recursal work, it's licensed as CC-BY-SA. Recursal Waifus (The banner image) are licensed under CC-BY-SA. They do not represent the related websites in any official capacity unless otherwise or announced by the website. You may use them as a banner image. However, you must always link back to the dataset. ### Citation Information ``` @ONLINE{lecturegratuits, title = {LectureGratuits}, author = {Darok, KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/recursal/Recursalberg}}, } ```

提供机构：

recursal

原始信息汇总

LectureGratuits 数据集概述

数据集描述

名称: LectureGratuits
来源: ebooksgratuits.com
语言: 英语
总词数: 约265.46M（llama-2-7b-chat-tokenizer）/ 约253.51M（RWKV Tokenizer）
许可证: 公共领域
处理后数据许可证: CC-BY-SA

数据集来源

原始数据: ebooksgratuits.com

数据处理

获取最高ID
枚举并下载所有epub文件: https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub
将文件放入名为books的文件夹
提取内容到output文件夹中的每个json文件（参见extract-text.py中的过滤步骤）
合并成单个文件

数据键

text (str): 书籍文本，转换为markdown格式
meta (dict): 包含以下键的元数据字典:
- title
- author
- publisher

数据集贡献者

主要贡献者: Darok
协助者: KaraKaraWitch

许可证信息

书籍: 公共领域
处理后数据: CC-BY-SA

引用信息

@ONLINE{lecturegratuits, title = {LectureGratuits}, author = {Darok, KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Recursalberg}}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集