five

omnilingual-asr-corpus

收藏
魔搭社区2026-01-06 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/facebook/omnilingual-asr-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
# Meta Omnilingual ASR Corpus The Omnilingual ASR Corpus is a collection of spontaneous speech recordings and their transcriptions for 348 under-served languages. The corpus was collected as part of Meta FAIR’s Omnilingual ASR project ([blog](https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition/), [model](https://github.com/facebookresearch/omnilingual-asr), [paper](https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/)) for the purposes of training automatic speech recognition (ASR) and spoken language identification models. ## Data schema ```json { `language`: "lij_Latn", `iso_639_3`: "lij", `iso_15924`: "Latn", `glottocode`: "geno1240", `prompt_id`: "C086", `prompt`: "What was the last thing you ate? Can you describe how it is made?", `speaker_id`: "spk02", `segment_id`: "s01", `audio`: "<Audio data in FLAC format>", `raw_text`: "Me son tòsto fæto un panetto co-o formaggio, ma quello a-a catalaña, saiva à dî con o pan un pittin brustolio e pöi a tomata sciaccâ in çimma, tanto euio e un pittin de sâ, e dapeu se ghe mette o companægo, into mæ caxo o formaggio.", } ``` ## Language codes Language codes in the `language` column follow the format `{lang}_{script}`, where `{lang}` is an ISO 639-3 three-letter language code, and `{script}` is an ISO 15924 four-letter script code. To allow for greater granularity when warranted, we provide the additional `glottocode` column, containing [Glottolog](http://glottolog.org/) languoid codes. ## Special tags The following special tags were used in transcriptions (`raw_text` field) to mark laughter, fillers and other types of non-verbal content: | Tag | Purpose | |--------------------|-------------| | `<laugh>` | The sound of laughter. | | `<hesitation>` | A hesitation sound, often used by speakers while thinking of the next thing to say. In English, some common hesitation sounds are “err”, “um”, “huh”, etc. | | `<unintelligible>` | A word or sequence of words that cannot be understood. | | `<noise>` | Any other type of noise, such as the speaker coughing or clearing their throat, a car honking, the sound of something hitting the microphone, a phone buzzing, etc. | ## Disfluencies Spontaneous speech naturally contains false starts, where only a fragment of a full word is produced. False starts were transcribed as they appeared in the recording and a hyphen was attached at the end of the word fragment (-), e.g.: > His name is Jo- Jona- Jonathan. Repeated words were also faithfully transcribed, e.g.: > And then I went to the the the bed- the bedroom ## License This corpus is released under CC-BY-4.0. ## Citation If you make use of this dataset in your work, please cite: ```bibtex @misc{omnilingualasr2025, title={{Omnilingual ASR}: Open-Source Multilingual Speech Recognition for 1600+ Languages}, author={{Omnilingual ASR Team} and Keren, Gil and Kozhevnikov, Artyom and Meng, Yen and Ropers, Christophe and Setzler, Matthew and Wang, Skyler and Adebara, Ife and Auli, Michael and Balioglu, Can and Chan, Kevin and Cheng, Chierh and Chuang, Joe and Droof, Caley and Duppenthaler, Mark and Duquenne, Paul-Ambroise and Erben, Alexander and Gao, Cynthia and Mejia Gonzalez, Gabriel and Lyu, Kehan and Miglani, Sagar and Pratap, Vineel and Sadagopan, Kaushik Ram and Saleem, Safiyyah and Turkatenko, Arina and Ventayol-Boada, Albert and Yong, Zheng-Xin and Chung, Yu-An and Maillard, Jean and Moritz, Rashel and Mourachko, Alexandre and Williamson, Mary and Yates, Shireen}, year={2025}, eprint={2511.09690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.09690}, } ```

# Meta 多语言自动语音识别语料库(Meta Omnilingual ASR Corpus) 多语言自动语音识别语料库(Omnilingual ASR Corpus)是面向348种服务不足语言的自发语音录音及其转写文本的集合。该语料库是Meta FAIR多语言自动语音识别项目的组成部分,相关资源可参考[博客](https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition/)、[模型代码](https://github.com/facebookresearch/omnilingual-asr)及[研究论文](https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/),其开发目标为训练自动语音识别(Automatic Speech Recognition,简称ASR)与口语语言识别模型。 ## 数据结构 json { "language": "lij_Latn", "iso_639_3": "lij", "iso_15924": "Latn", "glottocode": "geno1240", "prompt_id": "C086", "prompt": "What was the last thing you ate? Can you describe how it is made?", "speaker_id": "spk02", "segment_id": "s01", "audio": "<Audio data in FLAC format>", "raw_text": "Me son tòsto fæto un panetto co-o formaggio, ma quello a-a catalaña, saiva à dî con o pan un pittin brustolio e pöi a tomata sciaccâ in çimma, tanto euio e un pittin de sâ, e dapeu se ghe mette o companægo, into mæ caxo o formaggio.", } ## 语言编码规范 `language`列的语言编码格式为`{lang}_{script}`,其中`{lang}`为ISO 639-3标准的三位语言代码,`{script}`为ISO 15924标准的四位书写系统代码。为在必要时提供更细粒度的分类维度,我们额外增设了`glottocode`列,该列包含[Glottolog](http://glottolog.org/)语言族代码。 ## 特殊标记 以下特殊标记被用于转写文本(`raw_text`字段)中,用于标记笑声、填充词及其他非语言内容: | 标记 | 用途 | |---------------------|----------------------------------------------------------------------| | `<laugh>` | 笑声。 | | `<hesitation>` | 说话者构思下一句发言内容时发出的犹豫音,英语中常见的犹豫音包括“err”“um”“huh”等。 | | `<unintelligible>` | 无法识别的单个词语或词语序列。 | | `<noise>` | 其他各类噪声,例如说话者咳嗽或清嗓、汽车鸣笛、物品撞击麦克风、手机震动声等。 | ## 不流畅语音标注 自发语音天然包含未完成起始片段的情况,即说话者仅发出完整单词的部分音节。此类未完成片段将按照录音中的原始状态进行转写,并在单词片段末尾附加连字符(-),示例如下: > His name is Jo- Jona- Jonathan. 重复出现的词语也将被如实转写,示例如下: > And then I went to the the the bed- the bedroom ## 许可协议 本语料库采用CC-BY-4.0许可协议发布。 ## 引用方式 若您在研究工作中使用该数据集,请引用以下文献: bibtex @misc{omnilingualasr2025, title={{Omnilingual ASR}: Open-Source Multilingual Speech Recognition for 1600+ Languages}, author={{Omnilingual ASR Team} and Keren, Gil and Kozhevnikov, Artyom and Meng, Yen and Ropers, Christophe and Setzler, Matthew and Wang, Skyler and Adebara, Ife and Auli, Michael and Balioglu, Can and Chan, Kevin and Cheng, Chierh and Chuang, Joe and Droof, Caley and Duppenthaler, Mark and Duquenne, Paul-Ambroise and Erben, Alexander and Gao, Cynthia and Mejia Gonzalez, Gabriel and Lyu, Kehan and Miglani, Sagar and Pratap, Vineel and Sadagopan, Kaushik Ram and Saleem, Safiyyah and Turkatenko, Arina and Ventayol-Boada, Albert and Yong, Zheng-Xin and Chung, Yu-An and Maillard, Jean and Moritz, Rashel and Mourachko, Alexandre and Williamson, Mary and Yates, Shireen}, year={2025}, eprint={2511.09690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.09690}, }
提供机构:
maas
创建时间:
2025-11-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作