five

soynade-research/Bambara-Speech-Translation-Data

收藏
Hugging Face2026-02-22 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/soynade-research/Bambara-Speech-Translation-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - automatic-speech-recognition - translation language: - bm - en tags: - bambara - African-Next-Voices - ANV - RobotsMali - afvoices - asr pretty_name: Robots Backtranslated --- # AfVoices-Translated (Bambara-English) This is a Bambara speech translation dataset, which is built on the **African Next Voices (AfVoices)** Bambara ASR corpus. It provides English translations for the **human-corrected subset** of the original collection, creating a parallel corpus for Bambara-English machine translation and speech-to-text tasks. ## Methodology We machine-translated the human-validated transcriptions from AfVoices using the [Oolel-translator](https://github.com/soynade-research/oolel-translator) repository. - **Inference Engine**: [ms-swift](https://github.com/modelscope/ms-swift) with **vLLM**. - **Source Data**: `human-corrected` subset (~159 hours / 260k samples). - **Status**: Machine-translated; human expert validation is the next step. ### Translation Prompt To ensure technical consistency with the original ASR data, the following prompt was used: > *"As an expert translator, provide only the natural English translation of the following Bambara text while preserving all tags ([um], [cs], [noise], [?], [pause]) exactly as they appear without any additional commentary"* ## Transcription Tags This dataset preserves the original acoustic event tags to maintain synchronization with the audio: - `[um]`: Vocalized pauses/fillers - `[cs]`: Code-switching or foreign words - `[noise]`: Background noise - `[?]`: Inaudible or overlapped speech - `[pause]`: Long silence (>3-5 seconds) ## Credits \& Acknowledgments We would like to credit **[RobotsMali](https://huggingface.co/RobotsMali)** and the **African Next Voices (ANV)** project for the original data collection and human-corrected transcriptions. This work would not be possible without their efforts to build open-source resources for underrepresented African languages. ### Original Citation If you use this dataset, please cite the original AfVoices paper: ```bibtex @misc{diarra2025dealinghardfactslowresource, title={Dealing with the Hard Facts of Low-Resource African NLP}, author={Yacouba Diarra and Nouhoum Souleymane Coulibaly and Panga Azazia Kamaté and Madani Amadou Tall and Emmanuel Élisé Koné and Aymane Dembélé and Michael Leventhal}, year={2025}, eprint={2511.18557}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.18557}, } ``` ## Disclaimer These translations are currently **machine-generated**. Users should be aware that while the Bambara source is human-corrected, the English "translation" column may contain model-specific errors or hallucinations until the final human validation phase is complete.

--- license: 知识共享署名-相同方式共享4.0(CC BY-SA 4.0) task_categories: - 自动语音识别(Automatic Speech Recognition, ASR) - 机器翻译 language: - 班巴拉语(bm) - 英语(en) tags: - 班巴拉语(Bambara) - African-Next-Voices - ANV - RobotsMali - afvoices - 自动语音识别(ASR) pretty_name: 机器人回译(Robots Backtranslated) --- # 阿福语音翻译数据集(班巴拉语-英语) 本数据集为班巴拉语语音翻译数据集,基于**阿福语音(African Next Voices, AfVoices)**班巴拉语自动语音识别语料库构建,为原始语料库的**人工校正子集**提供英语译文,从而构建出可用于班巴拉语-英语机器翻译以及语音转文字任务的平行语料库。 ## 构建方法 研究团队借助[Oolel翻译器(Oolel-translator)](https://github.com/soynade-research/oolel-translator)代码库,对阿福语音语料库中的人工校验转录文本进行机器翻译。 - **推理引擎**:搭载**vLLM**的[ms-swift](https://github.com/modelscope/ms-swift)框架。 - **源数据**:`人工校正`子集(约159小时/26万条样本)。 - **当前状态**:已完成机器翻译;下一步将开展人工专家校验工作。 ### 翻译提示词 为确保与原始自动语音识别数据的技术一致性,本次翻译采用如下提示词: > *"作为专业翻译人员,请仅对下述班巴拉语文本生成自然流畅的英语译文,且需严格保留所有标签([um]、[cs]、[noise]、[?]、[pause])的原始格式,不得添加任何额外注释。"* ## 转录标签 本数据集保留原始声学事件标签,以确保与音频文件的时序同步: - `[um]`:带声停顿/填充词 - `[cs]`:语码转换或外来词汇 - `[noise]`:背景噪音 - `[?]`:无法听清或重叠语音 - `[pause]`:长静音(时长超过3-5秒) ## 致谢与贡献声明 衷心感谢**[RobotsMali](https://huggingface.co/RobotsMali)**与**阿福语音(African Next Voices, ANV)**项目团队完成原始数据采集与人工校正转录工作。若没有他们为弱势非洲语言构建开源资源的努力,本研究无法顺利开展。 ### 原始文献引用 若您使用本数据集,请引用阿福语音项目的原始论文: bibtex @misc{diarra2025dealinghardfactslowresource, title={Dealing with the Hard Facts of Low-Resource African NLP}, author={Yacouba Diarra and Nouhoum Souleymane Coulibaly and Panga Azazia Kamaté and Madani Amadou Tall and Emmanuel Élisé Koné and Aymane Dembélé and Michael Leventhal}, year={2025}, eprint={2511.18557}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.18557}, } ## 免责声明 当前生成的译文均为**机器自动产出**。请注意:尽管班巴拉语源文本已完成人工校正,但英语译文列仍可能存在模型特定的错误或幻觉内容,直至最终人工校验阶段完成后方可修正。
提供机构:
soynade-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作