allenai/OLMoASR-Mix
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/allenai/OLMoASR-Mix
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
---
OLMoASR-Mix is the curated version of OLMoASR-Pool, a web-scale audio-text dataset collected from the public internet. The dataset consists of approximately 1M hours of audio.
With OLMoASR-Mix from OLMoASR-Pool, we trained OLMoASR 💬🎙️, a series of English speech recognition models and observed strong generalization and robust capabilities!
# Content
The dataset spans approximately 1M hours of audio.
It also spans across a variety speaking styles, accents and audio setups such as news segments 📰, podcasts 🎙️, outdoors 🌳🏙️, crowds 🧑🤝🧑, speeches 🎤, commentary 🗣️, interviews 🤳 and more!
OLMoASR-Mix is English-only as it has been curated for training English speech recognition models.
# Usage
Download from HuggingFace
Retrieve HF access token from here to gain access to the dataset.
Run pip install huggingface_hub[cli]
Run huggingface-cli login in your CLI and paste the HF access token to login
Use the code below to access the IDs
```
from datasets import load_dataset
dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True)
print(dataset) # features: ['id']
print(next(iter(dataset['train'])))
```
If you're downloading all the IDs, you can run the code below
```
from datasets import load_dataset
dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<where you want to download the IDs to>)
```
Download the audio and transcript files from ID information.
Preprocess the audio and transcript files. Follow the instructions at the OLMoASR repo.
# Uses
The collection was used to train a speech recognition model, but it can also be used in research areas such as conversational data, audio understanding, speaker diarization, voice detection and more.
# License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.
# Reference
```
@misc{ngo2025olmoasropenmodelsdata,
title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models},
author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt},
year={2025},
eprint={2508.20869},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2508.20869},
}
```
# Contact
If you have any questions regarding the dataset, please contact Huong Ngo at zoengo2002@gmail.com.
license: odc-by
---
OLMoASR-Mix 是 OLMoASR-Pool 的精选版本,后者是从公开互联网采集的网页级音频-文本数据集,总时长约100万小时。
我们依托 OLMoASR-Pool 衍生的 OLMoASR-Mix,训练了 OLMoASR 💬🎙️ 系列英语语音识别模型,并观测到其具备优异的泛化能力与鲁棒性表现!
# 数据集内容
本数据集总时长约100万小时,涵盖多样的说话风格、口音与音频采集场景,包括新闻片段 📰、播客 🎙️、户外场景 🌳🏙️、多人环境 🧑🤝🧑、演讲 🎤、解说 🗣️、访谈 🤳 等多种类型。由于专为训练英语语音识别模型打造,OLMoASR-Mix 仅包含英语语料。
# 使用方法
从 HuggingFace 平台下载数据集:
1. 前往指定页面获取 HuggingFace 访问令牌(HF access token)以获取数据集访问权限。
2. 在命令行执行 `pip install huggingface_hub[cli]` 安装依赖工具。
3. 在命令行运行 `huggingface-cli login`,粘贴获取到的 HF 访问令牌完成登录。
4. 可使用以下代码获取数据集ID:
from datasets import load_dataset
dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True)
print(dataset) # features: ['id']
print(next(iter(dataset['train'])))
若需下载全部数据集ID,可使用以下代码:
from datasets import load_dataset
dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<自定义缓存目录>)
随后可根据ID信息下载音频与转录文本文件,并按照 OLMoASR 官方仓库的说明完成音频与转录文件的预处理。
# 适用领域
本数据集最初用于训练语音识别模型,同时亦可应用于对话数据研究、音频理解、说话人分割(Speaker Diarization)、语音检测等多个研究方向。
# 许可证
本数据集采用 ODC-BY 许可证发布,仅可用于研究与教育用途,并需遵循 AI2 的负责任使用指南。
# 参考文献
@misc{ngo2025olmoasropenmodelsdata,
title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models},
author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt},
year={2025},
eprint={2508.20869},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2508.20869},
}
# 联系方式
若对本数据集有任何疑问,请联系 Huong Ngo,邮箱地址为 zoengo2002@gmail.com。
提供机构:
allenai



