theKingslee/9ja-bookcorpus

Name: theKingslee/9ja-bookcorpus
Creator: theKingslee
Published: 2026-03-26 13:31:55
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/theKingslee/9ja-bookcorpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: apache-2.0 pretty_name: 160 Books from Nigerian Authors tags: - text - ocr - language-modeling - english - pdf-scrape - nigerian - 9ja-context short_description: A corpus of text data scraped from 160 PDFs from Nigerian Authors, cleaned and refined. long_description: ' This dataset contains text extracted from a collection of PDF documents. It has undergone several refinement steps including: - Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths). - Language detection to filter for English text. - Spell correction using NLTK''s English word corpus. This dataset is intended for use in language modeling, text generation, and other NLP tasks. ' --- # Dataset Card for 160 Books from Nigerian Authors  This dataset contains text extracted from a collection of 160 Nigerian authored books.  - **Curated by:** Abdullahi Mujaheed, Ayeni Oluwatosin, Nworie Kingsley - **Language(s) (NLP):** en - **License:** apache-2.0 ## Dataset Structure  The dataset consists of a single "train" split. Each example in the dataset contains a single feature: - **`text`**: A `string` value representing a single line of raw text extracted from the PDF documents. Each line is treated as an independent example in the dataset. There are no predefined validation or test splits; the entire loaded corpus is available in the "train" split. ### Dataset Sources [optional]  - **Repository:** [Build-an-llm](https://github.com/thekingslee/build-an-llm) - **Paper [optional]:** _coming soon_ - **Demo [optional]:** _coming soon_ ### Source Data  _[would be updated soon]_ #### Data Collection and Processing  It has undergone several refinement steps including: - Chunking - Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths). - Language detection to filter for English text. ## Dataset Card Authors Nworie Kingsley ## Dataset Card Contact **Maintainer(s):** - **Email:** me@thekingslee.com - **Hugging Face Profile (optional):** [@theKingslee](https://huggingface.co/theKingslee) - **GitHub Profile (optional):** [@theKingslee](https://github.com/theKingslee) PS: This is for educational purpose only.

提供机构：

theKingslee

5,000+

优质数据集

54 个

任务类型

进入经典数据集