five

ghananlpcommunity/ghanaian-english-words-corrected-transcriptions

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghanaian-english-words-corrected-transcriptions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: srt_file dtype: string - name: batch_index dtype: int64 - name: original dtype: string - name: corrected dtype: string - name: length_original dtype: int64 - name: length_corrected dtype: int64 - name: word_count_original dtype: int64 - name: word_count_corrected dtype: int64 splits: - name: train num_bytes: 29585409.473585416 num_examples: 342339 - name: test num_bytes: 3287296.5264145834 num_examples: 38038 download_size: 15655187 dataset_size: 32872706.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Ghanaian English Transcript Corrections A dataset of mistranscribed words and phrases from Ghanaian news media YouTube videos, corrected using Llama 3.1 405B. ## Source - Extracted from YouTube transcripts of Ghanaian news channels - ASR errors common in Ghanaian English accents and local terminology ## Correction Process - Raw transcripts reviewed for transcription errors - Corrections generated using Llama 3.1 405B model - Human-verified for accuracy ## Schema | Column | Description | |--------|-------------| | `srt_file` | Source video identifier | | `original` | Mistranscribed word/phrase (Youtube transcription) | ## Examples | Original | Corrected | Context | |----------|-----------|---------| | CD | cedi | Currency | | Galami | galamsey | Illegal mining | | Mame | Mahmud | Name | | on clause | UNCLOS | Acronym | | provide a a | provides a | Grammar | ## Usage ```python from datasets import load_dataset dataset = load_dataset("ghananlpcommunity/ghanaian-english-corrections")
提供机构:
ghananlpcommunity
二维码
社区交流群
二维码
科研交流群
商业服务