OLMoASR-Mix
收藏Hugging Face2026-03-24 更新2026-03-25 收录
下载链接:
https://huggingface.co/datasets/allenai/OLMoASR-Mix
下载链接
链接失效反馈官方服务:
资源简介:
OLMoASR-Mix是从OLMoASR-Pool中精选出的网络规模音频-文本数据集,收集自公共互联网。该数据集包含约100万小时的音频,涵盖了多种说话风格、口音和音频设置,如新闻片段、播客、户外环境、人群、演讲、评论和采访等。OLMoASR-Mix仅包含英语内容,专为训练英语语音识别模型而设计。该数据集已用于训练OLMoASR系列英语语音识别模型,并表现出强大的泛化能力和鲁棒性。数据集适用于语音识别、对话数据研究、音频理解、说话人日志和语音检测等多个研究领域。数据集采用ODC-BY许可,遵循Ai2的负责任使用指南,仅供研究和教育用途。
OLMoASR-Mix is a web-scale audio-text dataset curated from OLMoASR-Pool, which is collected from the public Internet. This dataset contains approximately 1 million hours of audio, covering diverse speaking styles, accents and audio scenarios including news segments, podcasts, outdoor environments, crowd scenes, speeches, commentaries and interviews. OLMoASR-Mix exclusively contains English content and is specifically designed for training English automatic speech recognition (ASR) models. It has been used to train the OLMoASR series of English ASR models, exhibiting strong generalization ability and robustness. The dataset is applicable to multiple research fields such as speech recognition, conversational data research, audio understanding, speaker diarization and voice detection. It is licensed under ODC-BY, complies with Ai2's responsible use guidelines, and is intended solely for research and educational purposes.
提供机构:
Allen Institute for AI
创建时间:
2026-03-24



