five

ghananlpcommunity/pristine-twi

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/pristine-twi
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - tw license: cc-by-4.0 task_categories: - text-generation - fill-mask pretty_name: Pristine Twi tags: - twi - akan - ghana - nlp - africa --- # Pristine Twi Dataset A large-scale Twi language dataset containing clean and naturally sounding Twi text across four distinct styles — monologue, narrative, dialogue, and storytelling — generated from real Ghanaian news topics as inspiration to keep it grounded on Ghanaian named entities and vocabulary. This dataset was built to support the development of Twi language models, tokenizers, and other NLP tools for the Akan language family. ## Dataset Details | Field | Detail | |------------- |-------------------------------| | Language | Twi (tw) — Akan, Ghana | | License | CC BY 4.0 | | Author | Mich-Seth Owusu | | Organization | Ghana NLP Community | | Format | Parquet (columnar) | ## Columns | Column | Type | Description | |--------------|--------|------------------------------------------| | `twi` | string | Twi text snippet | | `style` | string | Monologue, Narrative, Dialogue, Storyful | | `char_count` | int | Character count of the Twi text | ## Text Styles The dataset contains Twi text categorized into four distinct styles: - **Monologue** (*ɔkasafo biako*) — passionate, personal perspective - **Narrative** (*abakɔsɛm*) — journalistic, structured account - **Dialogue** (*nkɔmmɔdie*) — conversational exchange - **Storyful** (*anansesɛm*) — dramatic, metaphorical storytelling ## Generation Text was generated using the Gemini API with Ghanaian news paragraphs as prompts, separated into stylistic chunks, then cleaned to remove speaker labels, markdown artifacts, and any responses that failed Twi language validation. The code for generating the dataset can be found here: https://github.com/GhanaNLP/NLP-scripts/blob/main/text/generate_twi-gemini.py --- *Made with ❤️ by Mich-Seth Owusu for the Ghana NLP Community*
提供机构:
ghananlpcommunity
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作