Ghana-NLP/GA_ENGLISH_PARALLEL_TEXT
收藏Hugging Face2025-12-10 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Ghana-NLP/GA_ENGLISH_PARALLEL_TEXT
下载链接
链接失效反馈官方服务:
资源简介:
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>GhanaNLP Dataset</title>
<!-- Link to Font Awesome for icons -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
<style>
/* Global Styles */
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 0;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
box-sizing: border-box;
}
.section {
display: flex;
flex-wrap: wrap;
width: 100%;
margin-bottom: 20px;
}
.column {
flex: 1;
padding: 10px;
box-sizing: border-box;
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 10px;
border: 1px solid #ccc;
}
td, th {
border: 1px solid #ccc;
padding: 8px;
}
hr {
border: 1px solid #ccc;
margin: 20px 0;
}
</style>
</head>
<body>
<div class="container">
<hr>
<!-- Title Section -->
<div class="section">
<div class="column">
<p style="font-size: 20px; font-weight: bold;">GhanaNLP Ga and English
Data</p>
<p style="font-size: 12px;">
<a href="https://creativecommons.org/licenses/by/4.0/" target="_blank"
download style="text-decoration: none; color: #0073e6;">
Ga_to_English <i class="fas fa-download" style="font-size: 12px;
margin-left: 4px;"></i>
</a>
<span style="color: #666;"> • 1 MB • XLS</span><br>
</p>
</div>
<div class="column">
<p>The GhanaNLP Ga dataset contains sentence pairs in Ga and English,
designed to support translation models between these two languages.
Ga is a Ghanaian local language that lacks extensive digital resources,
making this dataset useful for language processing tools. The sentence
pairs take context into consideration. This dataset aims to bridge the
language barrier, enabling more inclusive AI applications and
language technology solutions in Ghana.This dataset can help improve translation
tools for Ga-English language pairs, enhance accessibility for Ga
speakers, and provide learning materials for language learners interested
in Ga. It may also aid in offering localized content to Ga speakers,
thus promoting linguistic inclusivity.</p>
</div>
</div>
<hr>
<!-- Metadata Section -->
<div class="section">
<div class="column">
<strong>PUBLISHER</strong><br>
<span style="font-size: 16px;">Google LLC</span>
</div>
<div class="column">
<strong>INDUSTRY TYPE</strong><br>
<span style="font-size: 16px;">Corporate - Tech</span>
</div>
<div class="column">
<strong>DATASET AUTHORS</strong><br>
GhanaNLP: Co-Author, 2024 <br>
</div>
</div>
<hr>
<div class="section">
<div class="column">
<strong>FUNDING</strong><br>
<span style="font-size: 16px;">Google LLC</span>
</div>
<div class="column">
<strong>FUNDING TYPE</strong><br>
<span style="font-size: 16px;">Private Funding</span>
</div>
<div class ="column">
<strong>DATASET CONTACT</strong><br>
<a href="mailto:natural.language.processing.gh@gmail.com"
style="text-decoration: none; color: #0073e6;">
natural.language.processing.gh
</a>
</div>
</div>
<hr>
<div class = "section">
<div class = "column">
<strong>DATASET PURPOSE</strong><br>
<span style="font-size: 16px;">Research</span><br>
<span style="font-size: 16px;">Education</span><br>
<span style="font-size: 16px;">Machine translation model and
applications focused on cross linguistic translation between
Ga and English.</span>
</div>
<div class = "column">
<strong>KEY APPLICATIONS</strong><br>
<span style="font-size: 16px;">Machine Translation</span><br>
<strong>PRIMARY MOTIVATION</strong><br>
<span style="font-size: 16px;">Improve access to information via
linguistic inclusion and bridge language barriers for the
underepresented Ghanaian languages like Ga in digital and natural
language resources.</span><br>
</div>
<div class ="column">
<strong>INTENDED/SUITABLE USE CASES</strong><br>
<span style="font-size: 16px;">Machine Translation and Localization</span><br>
<span style="font-size: 16px;">Education and e-learning</span><br>
<span style="font-size: 16px;">Voice Assistants and Chatbots</span><br>
<span style="font-size: 16px;">Cultural Preservation</span>
</div>
</div>
<hr>
<div class="section">
<div class="column">
<strong>DATA SUBJECT</strong><br>
<span style="font-size: 16px;">Non-sensitive data on
Ga Ghanaian language</span>
</div>
<div class="column">
<strong>DATASET SNAPSHOT</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td style="border: 1px solid #ccc;">Size of Dataset:</td>
<td style="border: 1px solid #ccc;">1 MB</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Instances:</td>
<td style="border: 1px solid #ccc;">11652</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Fields:</td>
<td style="border: 1px solid #ccc;">4</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Labeled Classes:</td>
<td style="border: 1px solid #ccc;">1</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Number of Labels:</td>
<td style="border: 1px solid #ccc;">11652</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Average Label per Instance:
</td>
<td style="border: 1px solid #ccc;">1</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Algorithmic Labels:</td>
<td style="border: 1px solid #ccc;">11652</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Human Labels:</td>
<td style="border: 1px solid #ccc;">11652</td>
</tr>
</table>
<strong>DATASET SOURCE</strong><br>
<ul style="list-style-type: disc; padding-left: 20px;">
<li>
Source text: <a href="your-link-here"
target="_blank">Wikipedia dataset-002</a>
</li>
<li>
Target text:<a href="your-link-here" target="_blank"> Oliver twist text
document</a>
</li>
<li>
Target text: Professional translation</a>
</li>
</ul>
</div>
</div>
<div class="column">
<strong>CONTENT DESCRIPTION</strong><br>
<span style="font-size: 16px;">This dataset contains sentence
pairs designed to support machine translation and natural
language processing applications. Texts include conversational
phrases, proverbs, and cultural expressions, reflecting the language
use in everyday Ghanaian contexts. The dataset also considers
dialectal diversity within Ga and provides tonal information
critical for meaning, making it suitable for both translation and
linguistic research.</span><br><br>
</div>
</div>
<hr>
<div class = "section">
<!-- DATA MODALITY Column -->
<div class = "column" style="flex: 1;">
<strong>PRIMARY DATA MODALITY</strong><br>
<span style="font-size: 16px;">Textual Data</span>
</div>
<!-- EXAMPLE Column -->
<div class = "column" style="flex: 2;">
<strong>EXAMPLE OF ACTUAL DATA POINT WITH DESCRIPTIONS</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse:
collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;">Source Language:</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">Ga
</td>
<td style="border: 1px solid #ccc;">Language of the original text
data collected</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Target Language</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">En
</td>
<td style="border: 1px solid #ccc;">Language text was translated
to</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Document ID</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">Ga
</td>
<td style="border: 1px solid #ccc;">Sheet ID containing text data
belonging to all dataset for Ga language</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Text ID</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;
">61249</td>
<td style="border: 1px solid #ccc;">Created to identify each data
point in the dataset</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Source Text</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;
">Abobalɔi boteɔ maŋ mli daa afi.
</td>
<td style="border: 1px solid #ccc;">Text curated from local sources,
books, news outlets, local linguists</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Target Text:</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">
A big number of refugees enter the country every year.
.</td>
<td style="border: 1px solid #ccc;">Translation of text curated
from local sources</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;">Dataset URL</td>
<td style="border: 1px solid #ccc; background-color: #f0f0f0;">#www.
wiroifnf.cominputlink</td>
<td style="border: 1px solid #ccc;">Link to collected and compiled
data from Novels, Wikipedia and Bolingo</td>
</tr>
</table>
</div>
</div>
</div>
<hr>
<div class = "section">
<div class = "column">
<strong>LICENSE TYPE(S)</strong><br>
<span style="font-size: 16px;">N/A</span>
</div>
<div class = "column">
<strong>LICENSE BREAKDOWN</strong><br>
<span style="font-size: 16px;">N/A</span>
</div>
<div class = "column">
<strong>LICENSE PERMISSIONS</strong><br>
<li> Share - create a copy and share in any reusable format or medium.
</li>
<li> Adapt - transform or update for reuse for any purpose but
commercialy.
</li>
<li> Attribution - give appropriate credit, provide link to the
license and state the adjustments made.
<li> Non-Commercial - restrict commertial use of the dataset
unless permission is granted.
<li> Share Alike - adaptations must be distributed under the same license
terms.
</li>
</div>
</div>
<hr>
<div class = "section">
<!-- VERSION STATUS Column -->
<div class = "column">
<strong>VERSION STATUS</strong><br>
<strong><span style="font-size: 16px;">Limited Maintenance</span></strong>
</div>
<!-- DATASET STATUS Column -->
<div class = "column">
<strong>DATASET STATUS</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%; border-collapse:
collapse; margin-top: 10px; border: 1px solid #ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Version:</strong></td>
<td style="border: 1px solid #ccc;">1.0</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Last Modified:</strong>
</td>
<td style="border: 1px solid #ccc;">07/2023</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>First Released:</strong></td>
<td style="border: 1px solid #ccc;">07/2023</td>
</tr>
<tr>
<td colspan="2" style="border: 1px solid #ccc; background-color:
#f0f0f0; text-align: center;">
<strong>Note:</strong> This dataset may be updated subsequently to
maintain accuracy and relevance of translation.
</td>
</tr>
</table>
</div>
</div>
<!-- MAINTENANCE PLAN Column -->
<div class = "column">
<strong>MAINTENANCE PLAN</strong><br>
<li> Updates are handled by a dedicated data team and engineering team.
The maintenance process includes revisiting labels and ensuring data
quality. The team will stay up to date on latest trends, new vocabulary,
and changes in language usage. </li>
<li> Data may be updated with more data points and comments for more
clarity, insight and quality per data point.</li>
</div>
</div>
<hr>
<div class="section">
<!--COLLECTION METHOD column-->
<div class="column">
<strong>DATA COLLECTION METHOD(S)</strong><br>
<strong><span style="font-size: 16px;">Scraped from domestic sources
and the internet</span></strong><br>
<strong><span style="font-size: 16px;">Independent paid
professionals</span></strong>
</div>
<!--DATA SOURCES BY COLLECTION METHOD(S) column-->
<div class="column">
<strong>DATA SOURCES BY COLLECTION METHOD(S)</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Scraped
</strong></td>
<td style="border: 1px solid #ccc;">Wikipedia, stories
and novel text and Bolingo(source text)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Translation
</strong></td>
<td style="border: 1px solid #ccc;">Human translations by
independent paid professionals(target text)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Anotations
</strong></td>
<td style="border: 1px solid #ccc;">Human added labels
and metadata</td>
</tr>
</table>
</div>
</div>
<!--SUMMARIES OF DATA COLLECTION METHOD column-->
<div class="column">
<strong>SUMMARIES OF DATA COLLECTION METHOD</strong><br>
<li> <strong>Scraped: </strong>Sentences obtained from local text
materials(source text)</li>
<li> <strong>Translation: </strong> source text was professionally
translated into target language focusing on high accuracy context
translation.</li>
<li> <strong>Annotations: </strong>Human added lables such as id,
target language and source text aid the comprehension of the
dataset.</li><br>
<strong>DATA COLLECTION CRITERIA - SCRAPPING</strong><br>
<li> Retrieves 500 sentences at a goal from the wikipedia dataset and
put them into a CSV file. Each sentence must have a subject, a verb,
and a complement.</li>
<li> Sentences were also extracted from stories/novels, Bolingo and
domestic sources.Text was cleaned and processed to remove urls and
empty lines and special characters. </li>
<li> The subject of each sentence must not start with a pronoun. Also
short sentences were ignored (i.e. sentences with less than 4 words)
as they often do not make sense.</li>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td colspan="2" style="border: 1px solid #ccc;
background-color: #f0f0f0; text-align: center;">
<strong>Note:</strong> This dataset did not highlight
the source for each data point but rather as a
collective.
</td>
</tr>
</table>
</div>
</div>
</div>
<hr>
<div class="section">
<!--LABELING METHOD(S) column-->
<div class="column">
<strong>LABELING METHOD(S)</strong><br>
<strong><span style="font-size: 16px;">Human labels</span>
</strong><br>
<strong><span style="font-size: 16px;">Algorithmic labels</span>
</strong>
</div>
<!--LABEL TYPE(S) column-->
<div class="column">
<strong>LABEL TYPE(S)</strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td colspan="2" style="border: 1px solid #ccc;
text-align: left;">
<strong>Human labels</strong>
</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>label
</strong></td>
<td style="border: 1px solid #ccc;">translated source
text to english by paid professionals. </td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>comment
</strong></td>
<td style="border: 1px solid #ccc;">Annotated to give
insights and quality accessments. These may be updated in
future(target text).</td>
</tr>
<tr>
<td colspan="2" style="border: 1px solid #ccc;
text-align: left;">
<strong>Algorithmic labels</strong>
</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>id</strong></td>
<td style="border: 1px solid #ccc;">Sequencial number
generated for each data point.</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>text
</strong></td>
<td style="border: 1px solid #ccc;">Extracted from
Bolingo and other online sources.</td>
</tr>
</table>
</div>
</div>
<!--LABELING PROCEDURE column-->
<div class="column">
<strong>LABELING PROCEDURE</strong><br>
<strong>Human Labels</strong><br>
Translations were made based of reading the entire complete sentence
to ensure highly acurate context translations. Comments
were created to give insight on each data point.<br>
<strong>Algorithmic Labels</strong><br>
<li> The <strong>id</strong> was generated sequentailly for each row
of data.</li>
<li> The <strong>text</strong> was extracted from sources such as the
oliver twist text files, Bolingo, wikipedia, the luganda text amongst
others.
</li>
</div>
</div>
<hr>
<div class="section">
<!--SAMPLING METHODS column-->
<div class="column">
<strong>SAMPLING METHODS</strong><br>
<strong><span style="font-size: 16px;">Purposive Sampling </span>
</strong>
</div>
<div class="column">
<strong>SAMPLING BREAKDOWN </strong><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Total Data
Sampled:</strong></td>
<td style="border: 1px solid #ccc;">90,000 entries</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Sample size
</strong></td>
<td style="border: 1px solid #ccc;">11,652 entries</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>SAMPLING CRITERIA</strong><br>
<li> <strong>Dialects</strong> Includes Shai Ga, Tema Ga, Teshie Ga,
La Ga and other variations of Ga amongst the Greater Accra people of
Ghana.
</li>
<li> <strong>Minimum text length:</strong> Not less than four(4) words
per sentence for good context. </li>
<li> <strong>Context and Damain:</strong> Spans different domains
and cultural specific terms. </li>
<li> <strong>Data Quality:</strong> Each sentence must have a
subject, verb and compliment and must not start with a pronoun,
ensuring that the texts are relevant to the domains being studied and
represent everyday language usage in Ga.<br></li>
</div>
</div>
<hr>
<!-- Known appilcation Section -->
<div class="section">
<div class="column">
<strong>ML APPLICATION(S)</strong><br>
<span style="font-size: 16px;">Machine Translation</span>
</div>
<div class="column">
<strong>EVALUATION RESULTS</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org"
target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI
</a></span><br>
<a href="https:/model.card/contact" target="_blank"
style="text-decoration: none; color: #0073e6;"> Khaya-model-card </a><br>
Evaluation Results<br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation
Method 1 </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation
method 2</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Evaluation
Method 3 </strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>EVALUATION PROCESS(ES)</strong><br>
<strong>Evaluation Method Used:</strong> method summary <br>
<li><strong>Process:</strong> method summary </li>
<li><strong>Factors:</strong> method summary </li>
<li><strong>Considerations:</strong> method summary </li>
<li><strong>Results:</strong> method summary </li>
</div>
</div>
<!-- Known application Section continued-->
<div class="section">
<div class="column">
</div>
<div class="column">
<strong>DESCRIPTION(S) AND STATISTIC(S)</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org"
target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI
</a></span><br>
<a href="https:/model.card/contact" target="_blank"
style="text-decoration: none; color: #0073e6;"> Khaya-model-card </a><br>
<span><strong>Model Description:</strong><a href="https://translate.ghananlp.org"
target="_blank" style="text-decoration: none; color:
#0073e6;"> Khaya AI </a> is a language translation model focused on
cross-lingual information flow for low-resourced languages
including Ga built by the GhanaNLP team</strong></span><br>
<div class="intro-text">
<table class="snapshot-table" style="width: 100%;
border-collapse: collapse; margin-top: 10px; border: 1px solid
#ccc;">
<tr>
<td style="border: 1px solid #ccc;"><strong>Model size
</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Model weight
</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
<tr>
<td style="border: 1px solid #ccc;"><strong>Model layers
</strong></td>
<td style="border: 1px solid #ccc;">123(params)</td>
</tr>
</table>
</div>
</div>
<div class="column">
<strong>EXPECTED PERFORMACE AND KNOWN CAVEATS</strong><br>
<span style="font-size: 16px;"><a href="https://translate.ghananlp.org"
target="_blank" style="text-decoration: none; color: #0073e6;"> Khaya AI
</a></span><br>
<span> <strong>Expected Performance: </strong> summary</span><br>
<span> <strong>Known Caveats: </strong> summary</span><br>
<span> <strong>Additional Notes if any: </strong> summary</span><br>
</div>
</div>
</div>
</body>
提供机构:
Ghana-NLP



