fiifinketia/pristine-twi-english
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fiifinketia/pristine-twi-english
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- tw
- en
license: cc-by-4.0
tags:
- twi
- akan
- ghana
- africa
- nlp
- translation
- parallel-corpus
task_categories:
- translation
size_categories:
- 100K<n<1M
pretty_name: Pristine Twi-English Parallel Dataset
---
# Pristine Twi–English Parallel Dataset
A large-scale **Twi ↔ English** parallel dataset derived from the
[Pristine Twi Dataset](https://huggingface.co/datasets/ghananlpcommunity/pristine-twi)
by the Ghana NLP Community.
The Twi source texts span four distinct styles — **Monologue, Narrative, Dialogue,
and Storyful** — grounded in real Ghanaian news topics to capture authentic vocabulary
and named entities. Each Twi passage has been translated into fluent English using
a batch translation pipeline powered by the Gemini API.
---
## Dataset Details
| Field | Detail |
|---|---|
| Languages | Twi (`tw`) → English (`en`) |
| License | CC BY 4.0 |
| Author | Mich-Seth Owusu |
| Organisation | Ghana NLP Community |
| Source dataset | [ghananlpcommunity/pristine-twi](https://huggingface.co/datasets/ghananlpcommunity/pristine-twi) |
| Translation model | Gemini API (batch + one-by-one fallback) |
| Format | Parquet |
---
## Columns
| Column | Type | Description |
|---|---|---|
| `twi` | string | Original Twi source text |
| `style` | string | Text style: Monologue, Narrative, Dialogue, Storyful |
| `english_translation` | string | Fluent English translation of the Twi text |
---
## Text Styles
The dataset preserves the four styles from the original Pristine Twi corpus:
- **Monologue** *(ɔkasafoɔ biako)* — passionate first-person perspective
- **Narrative** *(abakɔsɛm)* — journalistic, structured account
- **Dialogue** *(nkɔmmɔdie)* — conversational exchange between speakers
- **Storyful** *(anansesɛm)* — dramatic, metaphorical storytelling
---
## How the Translations Were Generated
Translations were produced using the **Gemini API** with a two-stage pipeline:
1. **Batch translation** — texts were grouped into batches of 30 and sent to the
model in a single prompt, separated by a `---` delimiter.
2. **One-by-one fallback** — if the model's response could not be parsed into the
expected number of translations, each text in the batch was retried individually.
Resume support was built in: already-translated rows were detected from the output
file and skipped, allowing the pipeline to be safely interrupted and restarted.
The translation code is available at:
[GhanaNLP/NLP-scripts](https://github.com/GhanaNLP/NLP-scripts)
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ghananlpcommunity/pristine-twi-english")
print(ds["train"][0])
```
---
## Intended Uses
- Twi ↔ English machine translation training and evaluation
- Low-resource NLP benchmarking for Akan/Twi
- Cross-lingual transfer learning
- Building Twi tokenisers and language models with aligned English supervision
---
## Limitations & Biases
- Translations were generated by a large language model and have **not been
manually verified** by native Twi speakers. Some translations may contain
errors, especially for idiomatic or culturally specific phrases.
- Source texts were synthetically generated from Ghanaian news topics, so the
domain coverage, while broad, may not represent all registers of spoken Twi.
- The dataset skews toward **written, formal Twi** and may under-represent
colloquial or dialectal variation across Akan sub-groups (Asante Twi, Akuapem
Twi, Fante, etc.).
---
## Citation
If you use this dataset, please cite the original Pristine Twi dataset and
acknowledge the Ghana NLP Community:
```bibtex
@dataset{owusu2024pristinetwieng,
author = {Owusu, Mich-Seth and Ghana NLP Community},
title = {Pristine Twi–English Parallel Dataset},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ghananlpcommunity/pristine-twi-english}
}
```
---
Made with ❤️ by [Mich-Seth Owusu](https://huggingface.co/ghananlpcommunity) for the
**Ghana NLP Community**.
提供机构:
fiifinketia



