A Bilingual Corpus for Twi-English Translation
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/x3f8w84s7h
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a multi-domain Twi-English parallel corpus of 16,085 sentence pairs, covering five thematic domains: casual text, depressed text, medical, toxic language, and agriculture. Unlike most existing Twi corpora that draw primarily from religious or formal texts, our dataset captures everyday and specialized language use. Data was collected through native Twi-speaking household collectors and a custom web-based Streamlit platform, with a three-tier quality control process yielding an acceptance rate of 95.3%. The Twi side contains 111,253 tokens with a vocabulary of 8,393 unique forms, while the English side contains 113,487 tokens and 7,427 unique forms. A sentence-length correlation of r = 0.7738 confirms strong translational fidelity across the corpus. The dataset was develop to support machine translation, sentiment analysis, mental health text classification, and content moderation research.
创建时间:
2026-02-23



