five

theKingslee/9ja-bookcorpus

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/theKingslee/9ja-bookcorpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en license: apache-2.0 pretty_name: 160 Books from Nigerian Authors tags: - text - ocr - language-modeling - english - pdf-scrape - nigerian - 9ja-context short_description: A corpus of text data scraped from 160 PDFs from Nigerian Authors, cleaned and refined. long_description: ' This dataset contains text extracted from a collection of PDF documents. It has undergone several refinement steps including: - Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths). - Language detection to filter for English text. - Spell correction using NLTK''s English word corpus. This dataset is intended for use in language modeling, text generation, and other NLP tasks. ' --- # Dataset Card for 160 Books from Nigerian Authors <!-- Provide a quick summary of the dataset. --> This dataset contains text extracted from a collection of 160 Nigerian authored books. <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** Abdullahi Mujaheed, Ayeni Oluwatosin, Nworie Kingsley - **Language(s) (NLP):** en - **License:** apache-2.0 ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset consists of a single "train" split. Each example in the dataset contains a single feature: - **`text`**: A `string` value representing a single line of raw text extracted from the PDF documents. Each line is treated as an independent example in the dataset. There are no predefined validation or test splits; the entire loaded corpus is available in the "train" split. ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [Build-an-llm](https://github.com/thekingslee/build-an-llm) - **Paper [optional]:** _coming soon_ - **Demo [optional]:** _coming soon_ ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> _[would be updated soon]_ #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> It has undergone several refinement steps including: - Chunking - Basic sanity checks (e.g., mostly alphabetic characters, reasonable word lengths). - Language detection to filter for English text. ## Dataset Card Authors Nworie Kingsley ## Dataset Card Contact **Maintainer(s):** - **Email:** me@thekingslee.com - **Hugging Face Profile (optional):** [@theKingslee](https://huggingface.co/theKingslee) - **GitHub Profile (optional):** [@theKingslee](https://github.com/theKingslee) PS: This is for educational purpose only.
提供机构:
theKingslee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作