ghananlpcommunity/ghanaian-english-words-corrected-transcriptions
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghanaian-english-words-corrected-transcriptions
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: srt_file
dtype: string
- name: batch_index
dtype: int64
- name: original
dtype: string
- name: corrected
dtype: string
- name: length_original
dtype: int64
- name: length_corrected
dtype: int64
- name: word_count_original
dtype: int64
- name: word_count_corrected
dtype: int64
splits:
- name: train
num_bytes: 29585409.473585416
num_examples: 342339
- name: test
num_bytes: 3287296.5264145834
num_examples: 38038
download_size: 15655187
dataset_size: 32872706.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# Ghanaian English Transcript Corrections
A dataset of mistranscribed words and phrases from Ghanaian news media YouTube videos, corrected using Llama 3.1 405B.
## Source
- Extracted from YouTube transcripts of Ghanaian news channels
- ASR errors common in Ghanaian English accents and local terminology
## Correction Process
- Raw transcripts reviewed for transcription errors
- Corrections generated using Llama 3.1 405B model
- Human-verified for accuracy
## Schema
| Column | Description |
|--------|-------------|
| `srt_file` | Source video identifier |
| `original` | Mistranscribed word/phrase (Youtube transcription) |
## Examples
| Original | Corrected | Context |
|----------|-----------|---------|
| CD | cedi | Currency |
| Galami | galamsey | Illegal mining |
| Mame | Mahmud | Name |
| on clause | UNCLOS | Acronym |
| provide a a | provides a | Grammar |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("ghananlpcommunity/ghanaian-english-corrections")
提供机构:
ghananlpcommunity



