ghananlpcommunity/pristine-twi
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/pristine-twi
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tw
license: cc-by-4.0
task_categories:
- text-generation
- fill-mask
pretty_name: Pristine Twi
tags:
- twi
- akan
- ghana
- nlp
- africa
---
# Pristine Twi Dataset
A large-scale Twi language dataset containing clean and naturally sounding Twi text across
four distinct styles — monologue, narrative, dialogue, and storytelling — generated
from real Ghanaian news topics as inspiration to keep it grounded on Ghanaian named entities and vocabulary.
This dataset was built to support the development of Twi language models, tokenizers,
and other NLP tools for the Akan language family.
## Dataset Details
| Field | Detail |
|------------- |-------------------------------|
| Language | Twi (tw) — Akan, Ghana |
| License | CC BY 4.0 |
| Author | Mich-Seth Owusu |
| Organization | Ghana NLP Community |
| Format | Parquet (columnar) |
## Columns
| Column | Type | Description |
|--------------|--------|------------------------------------------|
| `twi` | string | Twi text snippet |
| `style` | string | Monologue, Narrative, Dialogue, Storyful |
| `char_count` | int | Character count of the Twi text |
## Text Styles
The dataset contains Twi text categorized into four distinct styles:
- **Monologue** (*ɔkasafo biako*) — passionate, personal perspective
- **Narrative** (*abakɔsɛm*) — journalistic, structured account
- **Dialogue** (*nkɔmmɔdie*) — conversational exchange
- **Storyful** (*anansesɛm*) — dramatic, metaphorical storytelling
## Generation
Text was generated using the Gemini API with Ghanaian news paragraphs as
prompts, separated into stylistic chunks, then cleaned to remove speaker
labels, markdown artifacts, and any responses that failed Twi language validation.
The code for generating the dataset can be found here: https://github.com/GhanaNLP/NLP-scripts/blob/main/text/generate_twi-gemini.py
---
*Made with ❤️ by Mich-Seth Owusu for the Ghana NLP Community*
提供机构:
ghananlpcommunity



