theKingslee/9ja-bookcorpus
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/theKingslee/9ja-bookcorpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: apache-2.0
pretty_name: 160 Books from Nigerian Authors
tags:
- text
- ocr
- language-modeling
- english
- pdf-scrape
- nigerian
- 9ja-context
short_description: A corpus of text data scraped from 160 PDFs from Nigerian Authors,
cleaned and refined.
long_description: '
This dataset contains text extracted from a collection of PDF documents.
It has undergone several refinement steps including:
- Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths).
- Language detection to filter for English text.
- Spell correction using NLTK''s English word corpus.
This dataset is intended for use in language modeling, text generation, and other
NLP tasks.
'
---
# Dataset Card for 160 Books from Nigerian Authors
<!-- Provide a quick summary of the dataset. -->
This dataset contains text extracted from a collection of 160 Nigerian authored books.
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** Abdullahi Mujaheed, Ayeni Oluwatosin, Nworie Kingsley
- **Language(s) (NLP):** en
- **License:** apache-2.0
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
The dataset consists of a single "train" split. Each example in the dataset contains a single feature:
- **`text`**: A `string` value representing a single line of raw text extracted from the PDF documents. Each line is treated as an independent example in the dataset.
There are no predefined validation or test splits; the entire loaded corpus is available in the "train" split.
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [Build-an-llm](https://github.com/thekingslee/build-an-llm)
- **Paper [optional]:** _coming soon_
- **Demo [optional]:** _coming soon_
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
_[would be updated soon]_
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
It has undergone several refinement steps including:
- Chunking
- Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths).
- Language detection to filter for English text.
## Dataset Card Authors
Nworie Kingsley
## Dataset Card Contact
**Maintainer(s):**
- **Email:** me@thekingslee.com
- **Hugging Face Profile (optional):** [@theKingslee](https://huggingface.co/theKingslee)
- **GitHub Profile (optional):** [@theKingslee](https://github.com/theKingslee)
PS: This is for educational purpose only.
提供机构:
theKingslee



