five

fiifinketia/pristine-twi-english

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fiifinketia/pristine-twi-english
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tw - en license: cc-by-4.0 tags: - twi - akan - ghana - africa - nlp - translation - parallel-corpus task_categories: - translation size_categories: - 100K<n<1M pretty_name: Pristine Twi-English Parallel Dataset --- # Pristine Twi–English Parallel Dataset A large-scale **Twi ↔ English** parallel dataset derived from the [Pristine Twi Dataset](https://huggingface.co/datasets/ghananlpcommunity/pristine-twi) by the Ghana NLP Community. The Twi source texts span four distinct styles — **Monologue, Narrative, Dialogue, and Storyful** — grounded in real Ghanaian news topics to capture authentic vocabulary and named entities. Each Twi passage has been translated into fluent English using a batch translation pipeline powered by the Gemini API. --- ## Dataset Details | Field | Detail | |---|---| | Languages | Twi (`tw`) → English (`en`) | | License | CC BY 4.0 | | Author | Mich-Seth Owusu | | Organisation | Ghana NLP Community | | Source dataset | [ghananlpcommunity/pristine-twi](https://huggingface.co/datasets/ghananlpcommunity/pristine-twi) | | Translation model | Gemini API (batch + one-by-one fallback) | | Format | Parquet | --- ## Columns | Column | Type | Description | |---|---|---| | `twi` | string | Original Twi source text | | `style` | string | Text style: Monologue, Narrative, Dialogue, Storyful | | `english_translation` | string | Fluent English translation of the Twi text | --- ## Text Styles The dataset preserves the four styles from the original Pristine Twi corpus: - **Monologue** *(ɔkasafoɔ biako)* — passionate first-person perspective - **Narrative** *(abakɔsɛm)* — journalistic, structured account - **Dialogue** *(nkɔmmɔdie)* — conversational exchange between speakers - **Storyful** *(anansesɛm)* — dramatic, metaphorical storytelling --- ## How the Translations Were Generated Translations were produced using the **Gemini API** with a two-stage pipeline: 1. **Batch translation** — texts were grouped into batches of 30 and sent to the model in a single prompt, separated by a `---` delimiter. 2. **One-by-one fallback** — if the model's response could not be parsed into the expected number of translations, each text in the batch was retried individually. Resume support was built in: already-translated rows were detected from the output file and skipped, allowing the pipeline to be safely interrupted and restarted. The translation code is available at: [GhanaNLP/NLP-scripts](https://github.com/GhanaNLP/NLP-scripts) --- ## Usage ```python from datasets import load_dataset ds = load_dataset("ghananlpcommunity/pristine-twi-english") print(ds["train"][0]) ``` --- ## Intended Uses - Twi ↔ English machine translation training and evaluation - Low-resource NLP benchmarking for Akan/Twi - Cross-lingual transfer learning - Building Twi tokenisers and language models with aligned English supervision --- ## Limitations & Biases - Translations were generated by a large language model and have **not been manually verified** by native Twi speakers. Some translations may contain errors, especially for idiomatic or culturally specific phrases. - Source texts were synthetically generated from Ghanaian news topics, so the domain coverage, while broad, may not represent all registers of spoken Twi. - The dataset skews toward **written, formal Twi** and may under-represent colloquial or dialectal variation across Akan sub-groups (Asante Twi, Akuapem Twi, Fante, etc.). --- ## Citation If you use this dataset, please cite the original Pristine Twi dataset and acknowledge the Ghana NLP Community: ```bibtex @dataset{owusu2024pristinetwieng, author = {Owusu, Mich-Seth and Ghana NLP Community}, title = {Pristine Twi–English Parallel Dataset}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ghananlpcommunity/pristine-twi-english} } ``` --- Made with ❤️ by [Mich-Seth Owusu](https://huggingface.co/ghananlpcommunity) for the **Ghana NLP Community**.
提供机构:
fiifinketia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作