five

recursal/LectureGratuits

收藏
Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/recursal/LectureGratuits
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced license: cc-by-sa-4.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling source_datasets: - original language: - en pretty_name: Lecture Gratuits configs: - config_name: default data_files: - split: final path: data/* --- # Dataset Card for LectureGratuits ![](Recursalberg.png "Clara makes her return. See recursal/Recursalberg for her description!") *Waifu to catch your attention.* ## Dataset Details ### Dataset Description *LectureGratuits* is a cleaned dataset of [*Ebooks Gratuits*](https://www.ebooksgratuits.com/) books. We downloaded all the publicly available ebooks books at the time and processed them. Filtering to a total amount of tokens of **~265.46M** (llama-2-7b-chat-tokenizer) / **~253.51M** (RWKV Tokenizer) from primarily English language. - **Curated by:** Darok - **Funded by:** Recursal.ai - **Shared by:** KaraKaraWitch - **Language(s) (NLP):** English - **License:** Public domain ### Dataset Sources - **Source Data:** [ebooksgratuits.com](https://www.ebooksgratuits.com) ### Processing KaraKaraWitch doesn't have specifics on how it's processed. We have postiluated the following workflow / processing: 0. Get the higher ID 1. Enumerate and download all the epub files: `https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub` 2. Put them in a folder called `books` 3. extract content to each json file in `output` folder. (See filtering steps in `extract-text.py`) 4. Combine into a single file. ### Data Keys ``` text (str): The book's text. Converted to markdown. meta (dict): A dictionary of metadata with the following keys: - title - author - publisher ``` ### Dataset Curators This dataset was mainly Darok's work. I (KaraKaraWitch) only assisted them with questions and the writing of the dataset card. ### Licensing Information The books itself is in public domain. For the post processed data under Recursal work, it's licensed as CC-BY-SA. Recursal Waifus (The banner image) are licensed under CC-BY-SA. They do not represent the related websites in any official capacity unless otherwise or announced by the website. You may use them as a banner image. However, you must always link back to the dataset. ### Citation Information ``` @ONLINE{lecturegratuits, title = {LectureGratuits}, author = {Darok, KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {\url{https://huggingface.co/datasets/recursal/Recursalberg}}, } ```
提供机构:
recursal
原始信息汇总

LectureGratuits 数据集概述

数据集描述

  • 名称: LectureGratuits
  • 来源: ebooksgratuits.com
  • 语言: 英语
  • 总词数: 约265.46M(llama-2-7b-chat-tokenizer)/ 约253.51M(RWKV Tokenizer)
  • 许可证: 公共领域
  • 处理后数据许可证: CC-BY-SA

数据集来源

数据处理

  1. 获取最高ID
  2. 枚举并下载所有epub文件: https://www.ebooksgratuits.com/newsendbook.php?id=<ID>&format=epub
  3. 将文件放入名为books的文件夹
  4. 提取内容到output文件夹中的每个json文件(参见extract-text.py中的过滤步骤)
  5. 合并成单个文件

数据键

  • text (str): 书籍文本,转换为markdown格式
  • meta (dict): 包含以下键的元数据字典:
    • title
    • author
    • publisher

数据集贡献者

  • 主要贡献者: Darok
  • 协助者: KaraKaraWitch

许可证信息

  • 书籍: 公共领域
  • 处理后数据: CC-BY-SA

引用信息

@ONLINE{lecturegratuits, title = {LectureGratuits}, author = {Darok, KaraKaraWitch, recursal.ai}, year = {2024}, howpublished = {url{https://huggingface.co/datasets/recursal/Recursalberg}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作