five

ghananlpcommunity/ghana-english-corrected-transcriptions

收藏
Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ghananlpcommunity/ghana-english-corrected-transcriptions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: original_text dtype: string - name: corrected_text dtype: string splits: - name: train num_bytes: 144958537 num_examples: 770430 download_size: 98533457 dataset_size: 144958537 configs: - config_name: default data_files: - split: train path: data/train-* --- # Ghana English Corrected Transcriptions Dataset A semi-synthetic dataset of original and corrected transcriptions from Ghanaian news media, designed for training automatic speech recognition (ASR) and text-to-speech (TTS) correction models. ## Dataset Description This dataset contains paired examples of transcribed Ghanaian English news content. The "original" transcriptions are synthetically corrupted versions of clean text, while the "corrected" versions represent the ground truth. This approach allows for large-scale training data generation while maintaining linguistic authenticity. The corruption process uses real-world error patterns extracted from parallel word pairs (common ASR/TTS mistranscriptions in Ghanaian English), ensuring the synthetic errors reflect actual system failures encountered in the Ghanaian linguistic context. ## Dataset Generation Process ### 1. Data Sources - **Clean corpus**: High-quality corrected transcriptions from Ghanaian news media (`parallel_transcripts_merged-tts.csv`). - **Error patterns**: Real-world word-level mistranscription pairs (`parralel_words_merged.csv`). ### 2. Synthetic Corruption Pipeline The dataset is generated through a controlled injection process: 1. **Filtering**: Clean sentences are filtered to remove any that already contain known error patterns to prevent "double-corruption." 2. **Error Injection**: For sentences containing words present in the error dictionary: - Valid words are randomly replaced with their common mistranscriptions. - Multiple errors can be injected per sentence to simulate high-noise environments. - Case and punctuation are preserved to maintain syntactic structure. 3. **Clean Samples**: Approximately 10% of sentences are kept error-free to improve model robustness against over-correction. ### 3. Example Transformation | Component | Example | | ----------------------- | ---------------------------------------------------- | | **Original clean text** | "The government announced new agricultural policies" | | **Injected errors** | "The govament announce new agric policies" | | **Corrected text** | "The government announced new agricultural policies" | ## Dataset Structure | Column | Description | | ---------------- | ------------------------------------------------------------ | | `original_text` | Synthetically corrupted transcription (simulating ASR/TTS output) | | `corrected_text` | Ground truth corrected transcription | ## Statistics - **Total examples**: 770,430 rows - **With injected errors**: ~90% of dataset - **Clean (no errors)**: ~10% of dataset ## Usage ``` from datasets import load_dataset # Load the dataset dataset = load_dataset("ghananlpcommunity/ghana-english-corrected-transcription") # Access train split train_data = dataset["train"] # Example example = train_data[0] print(f"Original (corrupted): {example['original_text']}") print(f"Corrected: {example['corrected_text']}") ``` ## Intended Use Cases - Training seq2seq models (T5, BART, etc.) for Ghanaian English text correction. - Fine-tuning Large Language Models (LLMs) for ASR post-processing. - Developing TTS output refinement systems specific to West African accents. - Benchmarking correction model performance on Ghanaian English dialectal variations. ## Limitations - **Semi-synthetic nature**: While error patterns are derived from real mistranscriptions, the specific combinations are artificially generated. - **Domain specific**: Optimized for news media; may not fully generalize to conversational or informal "Pidgin" Ghanaian English. - **Error distribution**: Synthetic errors may not perfectly match the stochastic distribution of all real-world ASR hardware failures. ## Citation If you use this dataset in your research, please cite as follows: ``` @dataset{ghana_english_corrected_transcription, author = {Owusu, Mich-Seth}, title = {Ghana English Corrected Transcriptions Dataset}, year = {2026}, publisher = {Hugging Face}, url = {[https://huggingface.co/datasets/ghananlpcommunity/ghana-english-corrected-transcriptions](https://huggingface.co/datasets/ghananlpcommunity/ghana-english-corrected-transcriptions)} } ``` ## Acknowledgments This dataset was created by Mich-Seth Owusu for the Ghana NLP Community.
提供机构:
ghananlpcommunity
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作