five

A Bilingual Corpus for Twi-English Translation

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/x3f8w84s7h
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is a multi-domain Twi-English parallel corpus of 16,085 sentence pairs, covering five thematic domains: casual text, depressed text, medical, toxic language, and agriculture. Unlike most existing Twi corpora that draw primarily from religious or formal texts, our dataset captures everyday and specialized language use. Data was collected through native Twi-speaking household collectors and a custom web-based Streamlit platform, with a three-tier quality control process yielding an acceptance rate of 95.3%. The Twi side contains 111,253 tokens with a vocabulary of 8,393 unique forms, while the English side contains 113,487 tokens and 7,427 unique forms. A sentence-length correlation of r = 0.7738 confirms strong translational fidelity across the corpus. The dataset was develop to support machine translation, sentiment analysis, mental health text classification, and content moderation research.
创建时间:
2026-02-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作