Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT
收藏Hugging Face2025-12-10 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Ghana-NLP/TWI_ENGLISH_PARALLEL_TEXT
下载链接
链接失效反馈官方服务:
资源简介:
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>GhanaNLP Dataset</title>
<!-- Link to Font Awesome for icons -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
<style>
/* Global Styles */
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 0;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
box-sizing: border-box;
}
.section {
display: flex;
flex-wrap: wrap;
width: 100%;
margin-bottom: 20px;
}
.column {
flex: 1;
padding: 10px;
box-sizing: border-box;
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 10px;
border: 1px solid #ccc;
}
td, th {
border: 1px solid #ccc;
padding: 8px;
}
hr {
border: 1px solid #ccc;
margin: 20px 0;
}
</style>
</head>
<body>
<div class="container">
<!--hr -->
<!-- Title Section -->
<div class="section">
<div class="column">
<p style="font-size: 20px; font-weight: bold;">GhanaNLP Twi and English Parallel Data</p>
<p style="font-size: 12px;">
<a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" download style="text-decoration: none; color: #0073e6;">
Twi_to_English <i class="fas fa-download" style="font-size: 12px; margin-left: 4px;"></i>
</a>
<span style="color: #666;"> • 1 MB • XLS</span><br>
<a href="https://creating.com/" target="_blank" download style="text-decoration: none; color: #0073e6;">
English_to_Twi <i class="fas fa-download" style="font-size: 12px; margin-left: 4px;"></i>
</a>
<span style="color: #666;"> • 1 MB • XLS</span>
</p>
</div>
<div class="column">
<p>The GhanaNLP Twi dataset contains sentence pairs in Twi and English, designed to support translation models between these two languages. Twi is a Ghanaian local language that lacks extensive digital resources, making this dataset useful for language processing tools. The sentence pairs take context into consideration. This dataset aims to bridge the language barrier, enabling more inclusive AI applications and language technology solutions in Ghana. This dataset can help improve translation tools for Twi-English language pairs, enhance accessibility for Twi speakers, and provide learning materials for language learners interested in Twi. It may also aid in offering localized content to Twi speakers, thus promoting linguistic inclusivity.</p>
</div>
</div>
<hr>
<!-- Metadata Section -->
<div class="section">
<div class="column">
<strong>PUBLISHER(S)</strong><br>
<span style="font-size: 16px;">Google LLC</span>
</div>
<div class="column">
<strong>INDUSTRY TYPE</strong><br>
<span style="font-size: 16px;">Corporate - Tech</span>
</div>
<div class="column">
<strong>DATACARD AUTHORS</strong><br>
GhanaNLP: Co-Author, 2024 <br>
</div>
</div>
<div class="section">
<div class="column">
<strong>FUNDING</strong><br>
<span style="font-size: 16px;">Google LLC</span>
</div>
<div class="column">
<strong>FUNDING TYPE</strong><br>
<span style="font-size: 16px;">Private Funding</span>
</div>
<div class ="column">
<strong>DATASET CONTACT</strong><br>
<a href="mailto:natural.language.processing.gh@gmail.com" style="text-decoration: none; color: #0073e6;">
natural.language.processing.gh
</a>
</div>
</div>
<hr>
<div class = "section">
<div class = "column">
<strong>DATASET PURPOSE</strong><br>
<span style="font-size: 16px;">Research</span><br>
<span style="font-size: 16px;">Education</span><br>
<span style="font-size: 16px;">Machine translation model and applications focused on cross-linguistic translation between Twi and English.</span>
</div>
<div class = "column">
<strong>KEY APPLICATIONS</strong><br>
<span style="font-size: 16px;">Machine Translation</span><br>
<strong>PRIMARY MOTIVATION</strong><br>
<span style="font-size: 16px;">Improve access to information via linguistic inclusion and bridge language barriers for the underrepresented Ghanaian languages like Twi in digital and natural language resources.</span><br>
</div>
<div class ="column">
<strong>INTENDED/SUITABLE USE CASES</strong><br>
<span style="font-size: 16px;">Machine Translation and Localization</span><br>
<span style="font-size: 16px;">Education and e-learning</span><br>
<span style="font-size: 16px;">Voice Assistants and Chatbots</span><br>
<span style="font-size: 16px;">Cultural Preservation</span>
</div>
</div>
<hr>
<div class="section">
<div class="column">
<strong>DATA SUBJECT</strong><br>
<span style="font-size: 16px;">Non-sensitive data on Twi Ghanaian language</span>
</div>
<div class="column">
<strong>DATASET SNAPSHOT</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;">Size of Dataset:</td>
<td style="border: 1px solid #ccc;">1 MB</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Instances:</td>
<td style="border: 1px solid #ccc;">14875</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Fields:</td>
<td style="border: 1px solid #ccc;">4</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Labeled Classes:</td>
<td style="border: 1px solid #ccc;">1</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Labels:</td>
<td style="border: 1px solid #ccc;">14875</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Average Label per Instance:</td>
<td style="border: 1px solid #ccc;">1</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Algorithmic Labels:</td>
<td style="border: 1px solid #ccc;">14875</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Human Labels:</td>
<td style="border: 1px solid #ccc;">14875</td>
</tr>
</table>
<strong>DATASET SOURCE</strong><br>
<ul style="list-style-type: disc; padding-left: 20px;">
<li>
Source text: <a href="your-link-here" target="_blank">Wikipedia dataset-002</a> ; <br>
<a href="your-link-here" target="_blank">Oliver twist text document</a>
</li>
<li>
Target text: Professional translation</a>
</li>
</ul>
</div>
</div>
<div class="column">
<strong>CONTENT DESCRIPTION</strong><br>
<span style="font-size: 16px;">This parallel dataset contains sentence pairs designed to support machine translation and natural language processing applications. Texts include conversational phrases, proverbs, and cultural expressions, reflecting the language use in everyday Ghanaian contexts. The dataset also considers dialectal diversity within Twi and provides tonal information critical for meaning, making it suitable for both translation and linguistic research.</span><br><br>
<span style="font-size: 16px;">The dataset is divided into two parts: English-to-Twi translation and Twi-to-English translation. This structure is designed to ensure high accuracy and maintain contextual integrity in translations. </span>
</div>
</div>
<hr>
<div class = "section">
<!-- DATA MODALITY Column -->
<div class = "column" style="flex: 1;">
<strong>PRIMARY DATA MODALITY</strong><br>
<span style="font-size: 16px;">Textual Data</span>
</div>
<!-- EXAMPLE Column -->
<div class = "column" style="flex: 2;">
<strong>EXAMPLE OF ACTUAL DATA POINT WITH DESCRIPTIONS</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;">Source Language:</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">Twi</td>
<td style="border: 1px solid #ccc;">Language of the original text data collected</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Target Language</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">En</td>
<td style="border: 1px solid #ccc;">Language text was translated to</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Document ID</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">Twi</td>
<td style="border: 1px solid #ccc;">Sheet ID containing text data belonging to all dataset for Twi language</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Text ID</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">46676</td>
<td style="border: 1px solid #ccc;">Created to identify each data point in the dataset</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Source Text</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">Wᴐkyerԑ sԑ, ԑsԑ sԑ yԑkora nhoma ne biribiara a ԑboa ma yԑtumi nya egya de di dwuma wᴐ mframagya mu no, na yԑnam so abᴐ yԑn ho ban afiri atoyerԑnkyԑm ho.</td>
<td style="border: 1px solid #ccc;">Text curated from local sources, books, news outlets, local linguists</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Target Text:</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">They suggest that we should keep books and anything that helps us to get fire for use in the wind, and thus protect ourselves from disasters.</td>
<td style="border: 1px solid #ccc;">Translation of text curated from multiple sources</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Dataset URL</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">#www.wiroifnf.cominputlink</td>
<td style="border: 1px solid #ccc;">Link to collected and compiled data from Novels, Wikipedia and Bolingo</td>
</tr>
</table>
</div>
</div>
</div>
<hr>
<div class = "section">
<div class = "column">
<strong>LICENSE TYPE(S)</strong><br>
<span style="font-size: 16px;">N/A</span>
</div>
<div class = "column">
<strong>LICENSE BREAKDOWN</strong><br>
<span style="font-size: 16px;">N/A</span>
</div>
<div class = "column">
<strong>LICENSE PERMISSIONS</strong><br>
<li> Share - create a copy and share in any reusable format or medium.
</li>
<li> Adapt - transform or update for reuse for any purpose but commercialy.
</li>
<li> Attribution - give appropriate credit, provide link to the license and state the adjustments made.
<li> Non-Commercial - restrict commercial use of the dataset unless permission is granted.
<li> Share Alike - adaptations must be distributed under the same license terms.
</li>
</div>
</div>
<hr>
<div class = "section">
<!-- VERSION STATUS Column -->
<div class = "column">
<strong>VERSION STATUS</strong><br>
<strong><span style="font-size: 16px;">Limited Maintenance</span></strong>
</div>
<!-- DATASET STATUS Column -->
<div class = "column">
<strong>DATASET STATUS</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Version:</strong></td>
<td style="border: 1px solid #ccc;">1.0</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Last Modified:</strong></td>
<td style="border: 1px solid #ccc;">07/2023</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>First Released:</strong></td>
<td style="border: 1px solid #ccc;">07/2023</td>
</tr>
<tr>
<td colspan="2" style="border: 1px solid #ccc; background-color: #f0f0f0; text-align: center;">
<strong>Note:</strong> This dataset may be updated subsequently to maintain accuracy and relevance of translation.
</td>
</tr>
</table>
</div>
</div>
<!-- MAINTENANCE PLAN Column -->
<div class = "column">
<strong>MAINTENANCE PLAN</strong><br>
<li> Updates are handled by a dedicated data team and engineering team. The maintenance process includes revisiting labels and ensuring data quality. The team will stay up to date on latest trends, new vocabulary, and changes in language usage. </li>
<li> Data may be updated with more data points and comments for more clarity, insight and quality per data point.</li>
</div>
</div>
<hr>
<div class="section">
<!--COLLECTION METHOD column-->
<div class="column">
<strong>DATA COLLECTION METHOD(S)</strong><br>
<strong><span style="font-size: 16px;">Scraped from domestic sources and the internet</span></strong><br>
<strong><span style="font-size: 16px;">Independent paid professionals</span></strong>
</div>
<!--DATA SOURCES BY COLLECTION METHOD(S) column-->
<div class="column">
<strong>DATA SOURCES BY COLLECTION METHOD(S)</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Scraped </strong></td>
<td style="border: 1px solid #ccc;">Wikipedia, stories and novel text and Bolingo(source text)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Translation </strong></td>
<td style="border: 1px solid #ccc;">Human translations by independent paid professionals(target text)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Anotations </strong></td>
<td style="border: 1px solid #ccc;">Human added labels and metadata</td>
</tr>
</table>
</div>
</div>
<!--SUMMARIES OF DATA COLLECTION METHOD column-->
<div class="column">
<strong>SUMMARIES OF DATA COLLECTION METHOD</strong><br>
<li> <strong>Scraped: </strong>Sentences obtained from local text materials(source text).</li>
<li> <strong>Translation: </strong> source text was professionally translated into target language focusing on high accuracy context translation. </li>
<li> <strong>Annotations: </strong>Human added lables such as id, target language and source text aid the comprehension of the dataset.</li><br>
<strong>DATA COLLECTION CRITERIA - SCRAPPING</strong><br>
<li> Retrieves 500 sentences at a goal from the wikipedia dataset and put them into a CSV file. Each sentence must have a subject, a verb, and a complement.</li>
<li> Sentences were also extracted from stories/novels, Bolingo and domestic sources. Text was cleaned and processed to remove urls and empty lines and special characters. </li>
<li> The subject of each sentence must not start with a pronoun. Also short sentences were ignored (i.e. sentences with less than 4 words) as they often do not make sense.</li>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td colspan="2" style="border: 1px solid #ccc; background-color: #f0f0f0; text-align: center;">
<strong>Note:</strong> This dataset did not highlight the source for each data point but rather as a collective.
</td>
</tr>
</table>
</div>
</div>
</div>
<hr>
<div class="section">
<!--LABELING METHOD(S) column-->
<div class="column">
<strong>LABELING METHOD(S)</strong><br>
<strong><span style="font-size: 16px;">Human labels</span></strong><br>
<strong><span style="font-size: 16px;">Algorithmic labels</span></strong>
</div>
<!--LABEL TYPE(S) column-->
<div class="column">
<strong>LABEL TYPE(S)</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td colspan="2" style="border: 1px solid #ccc; text-align: left;">
<strong>Human labels</strong>
</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>label</strong></td>
<td style="border: 1px solid #ccc;">Translated source text to english by paid professionals </td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>comment</strong></td>
<td style="border: 1px solid #ccc;">Annotated to give insights and quality accessments. These may be updated in future(target text)</td>
</tr>
<tr>
<td colspan="2" style="border: 1px solid #ccc; text-align: left;">
<strong>Algorithmic labels</strong>
</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>id</strong></td>
<td style="border: 1px solid #ccc;">Sequencial number generated for each data point</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>text</strong></td>
<td style="border: 1px solid #ccc;">Extracted from Bolingo and other online sources</td>
</tr>
</table>
</div>
</div>
<!--LABELING PROCEDURE column-->
<div class="column">
<strong>LABELING PROCEDURE</strong><br>
<strong>Human Labels</strong><br>
Translations were made based of reading the entire complete sentence to ensure highly acurate context translations. Comments were created to give insight on each data point.<br>
<strong>Algorithmic Labels</strong><br>
<li> The <strong>id</strong> was generated sequentailly for each row of data.</li>
<li> The <strong>text</strong> was extracted from sources such as the oliver twist text files, Bolingo, wikipedia, the luganda text amongst others.
</div>
</div>
<hr>
<div class="section">
<!--SAMPLING METHODS column-->
<div class="column">
<strong>SAMPLING METHODS</strong><br>
<strong><span style="font-size: 16px;">Purposive Sampling</span></strong>
</div>
<div class="column">
<strong>SAMPLING BREAKDOWN </strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Total Data Sampled:</strong></td>
<td style="border: 1px solid #ccc;">90,000 entries</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Sample size</strong></td>
<td style="border: 1px solid #ccc;">14875 entries</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>SAMPLING CRITERIA</strong><br>
<li> <strong>Dialects</strong> Includes Akwapim Twi, Asante Twi and other variations of twi amongst the Akan people of Ghana. </li>
<li> <strong>Minimum text length:</strong> Not less than four(4) words per sentence for good context. </li>
<li> <strong>Context and Damain:</strong> Spans different domains and cultural specific terms. </li>
<li> <strong>Data Quality:</strong> Each sentence must have a subject, verb and compliment and must not start with a pronoun, ensuring that the texts are relevant to the domains being studied and represent everyday language usage in Twi.<br> </li>
</div>
</div>
<hr>
<!-- Known appilcation Section -->
<div class="section">
<div class="column">
<strong>ML APPLICATION(S)</strong><br>
<span style="font-size: 16px;">Machine Translation</span>
</div>
<div class="column">
<strong>EVALUATION RESULTS</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI </a></span><br>
<a href="https:/model.card/contact" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya-model-card </a><br>
Evaluation Results<br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation Method 1 </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation method 2</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation Method 3 </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>EVALUATION PROCESS(ES)</strong><br>
<strong>Evaluation Method Used:</strong> method summary <br>
<li><strong>Process:</strong> method summary </li>
<li><strong>Factors:</strong> method summary </li>
<li><strong>Considerations:</strong> method summary </li>
<li><strong>Results:</strong> method summary </li>
</div>
</div>
<!-- Known application Section continued-->
<div class="section">
<div class="column">
</div>
<div class="column">
<strong>DESCRIPTION(S) AND STATISTIC(S)</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI </a></span><br>
<a href="https:/model.card/contact" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya-model-card </a><br>
<span><strong>Model Description:</strong><a href="https://translate.ghananlp.org" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI </a> is a language translation model focused on cross-lingual information flow for low-resourced languages including Twi built by the GhanaNLP team</strong></span><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse: collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Model size </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Model weight</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Model layers </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>EXPECTED PERFORMACE AND KNOWN CAVEATS</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org" target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI </a></span><br>
<span> <strong>Expected Performance: </strong> summary</span><br>
<span> <strong>Known Caveats: </strong> summary</span><br>
<span> <strong>Additional Notes if any: </strong> summary</span><br>
</div>
</div>
</div>
</body>
提供机构:
Ghana-NLP



