soynade-research/Bambara-Speech-Translation-Data
收藏Hugging Face2026-02-22 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/soynade-research/Bambara-Speech-Translation-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- automatic-speech-recognition
- translation
language:
- bm
- en
tags:
- bambara
- African-Next-Voices
- ANV
- RobotsMali
- afvoices
- asr
pretty_name: Robots Backtranslated
---
# AfVoices-Translated (Bambara-English)
This is a Bambara speech translation dataset, which is built on the **African Next Voices (AfVoices)** Bambara ASR corpus. It provides English translations for the **human-corrected subset** of the original collection, creating a parallel corpus for Bambara-English machine translation and speech-to-text tasks.
## Methodology
We machine-translated the human-validated transcriptions from AfVoices using the [Oolel-translator](https://github.com/soynade-research/oolel-translator) repository.
- **Inference Engine**: [ms-swift](https://github.com/modelscope/ms-swift) with **vLLM**.
- **Source Data**: `human-corrected` subset (~159 hours / 260k samples).
- **Status**: Machine-translated; human expert validation is the next step.
### Translation Prompt
To ensure technical consistency with the original ASR data, the following prompt was used:
> *"As an expert translator, provide only the natural English translation of the following Bambara text while preserving all tags ([um], [cs], [noise], [?], [pause]) exactly as they appear without any additional commentary"*
## Transcription Tags
This dataset preserves the original acoustic event tags to maintain synchronization with the audio:
- `[um]`: Vocalized pauses/fillers
- `[cs]`: Code-switching or foreign words
- `[noise]`: Background noise
- `[?]`: Inaudible or overlapped speech
- `[pause]`: Long silence (>3-5 seconds)
## Credits \& Acknowledgments
We would like to credit **[RobotsMali](https://huggingface.co/RobotsMali)** and the **African Next Voices (ANV)** project for the original data collection and human-corrected transcriptions. This work would not be possible without their efforts to build open-source resources for underrepresented African languages.
### Original Citation
If you use this dataset, please cite the original AfVoices paper:
```bibtex
@misc{diarra2025dealinghardfactslowresource,
title={Dealing with the Hard Facts of Low-Resource African NLP},
author={Yacouba Diarra and Nouhoum Souleymane Coulibaly and Panga Azazia Kamaté and Madani Amadou Tall and Emmanuel Élisé Koné and Aymane Dembélé and Michael Leventhal},
year={2025},
eprint={2511.18557},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.18557},
}
```
## Disclaimer
These translations are currently **machine-generated**. Users should be aware that while the Bambara source is human-corrected, the English "translation" column may contain model-specific errors or hallucinations until the final human validation phase is complete.
---
license: 知识共享署名-相同方式共享4.0(CC BY-SA 4.0)
task_categories:
- 自动语音识别(Automatic Speech Recognition, ASR)
- 机器翻译
language:
- 班巴拉语(bm)
- 英语(en)
tags:
- 班巴拉语(Bambara)
- African-Next-Voices
- ANV
- RobotsMali
- afvoices
- 自动语音识别(ASR)
pretty_name: 机器人回译(Robots Backtranslated)
---
# 阿福语音翻译数据集(班巴拉语-英语)
本数据集为班巴拉语语音翻译数据集,基于**阿福语音(African Next Voices, AfVoices)**班巴拉语自动语音识别语料库构建,为原始语料库的**人工校正子集**提供英语译文,从而构建出可用于班巴拉语-英语机器翻译以及语音转文字任务的平行语料库。
## 构建方法
研究团队借助[Oolel翻译器(Oolel-translator)](https://github.com/soynade-research/oolel-translator)代码库,对阿福语音语料库中的人工校验转录文本进行机器翻译。
- **推理引擎**:搭载**vLLM**的[ms-swift](https://github.com/modelscope/ms-swift)框架。
- **源数据**:`人工校正`子集(约159小时/26万条样本)。
- **当前状态**:已完成机器翻译;下一步将开展人工专家校验工作。
### 翻译提示词
为确保与原始自动语音识别数据的技术一致性,本次翻译采用如下提示词:
> *"作为专业翻译人员,请仅对下述班巴拉语文本生成自然流畅的英语译文,且需严格保留所有标签([um]、[cs]、[noise]、[?]、[pause])的原始格式,不得添加任何额外注释。"*
## 转录标签
本数据集保留原始声学事件标签,以确保与音频文件的时序同步:
- `[um]`:带声停顿/填充词
- `[cs]`:语码转换或外来词汇
- `[noise]`:背景噪音
- `[?]`:无法听清或重叠语音
- `[pause]`:长静音(时长超过3-5秒)
## 致谢与贡献声明
衷心感谢**[RobotsMali](https://huggingface.co/RobotsMali)**与**阿福语音(African Next Voices, ANV)**项目团队完成原始数据采集与人工校正转录工作。若没有他们为弱势非洲语言构建开源资源的努力,本研究无法顺利开展。
### 原始文献引用
若您使用本数据集,请引用阿福语音项目的原始论文:
bibtex
@misc{diarra2025dealinghardfactslowresource,
title={Dealing with the Hard Facts of Low-Resource African NLP},
author={Yacouba Diarra and Nouhoum Souleymane Coulibaly and Panga Azazia Kamaté and Madani Amadou Tall and Emmanuel Élisé Koné and Aymane Dembélé and Michael Leventhal},
year={2025},
eprint={2511.18557},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.18557},
}
## 免责声明
当前生成的译文均为**机器自动产出**。请注意:尽管班巴拉语源文本已完成人工校正,但英语译文列仍可能存在模型特定的错误或幻觉内容,直至最终人工校验阶段完成后方可修正。
提供机构:
soynade-research



