allenai/OLMoASR-Mix

Name: allenai/OLMoASR-Mix
Creator: allenai
Published: 2026-03-23 20:15:14
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/allenai/OLMoASR-Mix

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by --- OLMoASR-Mix is the curated version of OLMoASR-Pool, a web-scale audio-text dataset collected from the public internet. The dataset consists of approximately 1M hours of audio. With OLMoASR-Mix from OLMoASR-Pool, we trained OLMoASR 💬🎙️, a series of English speech recognition models and observed strong generalization and robust capabilities! # Content The dataset spans approximately 1M hours of audio. It also spans across a variety speaking styles, accents and audio setups such as news segments 📰, podcasts 🎙️, outdoors 🌳🏙️, crowds 🧑‍🤝‍🧑, speeches 🎤, commentary 🗣️, interviews 🤳 and more! OLMoASR-Mix is English-only as it has been curated for training English speech recognition models. # Usage Download from HuggingFace Retrieve HF access token from here to gain access to the dataset. Run pip install huggingface_hub[cli] Run huggingface-cli login in your CLI and paste the HF access token to login Use the code below to access the IDs ``` from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True) print(dataset) # features: ['id'] print(next(iter(dataset['train']))) ``` If you're downloading all the IDs, you can run the code below ``` from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<where you want to download the IDs to>) ``` Download the audio and transcript files from ID information. Preprocess the audio and transcript files. Follow the instructions at the OLMoASR repo. # Uses The collection was used to train a speech recognition model, but it can also be used in research areas such as conversational data, audio understanding, speaker diarization, voice detection and more. # License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines. # Reference ``` @misc{ngo2025olmoasropenmodelsdata, title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models}, author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt}, year={2025}, eprint={2508.20869}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2508.20869}, } ``` # Contact If you have any questions regarding the dataset, please contact Huong Ngo at zoengo2002@gmail.com.

license: odc-by --- OLMoASR-Mix 是 OLMoASR-Pool 的精选版本，后者是从公开互联网采集的网页级音频-文本数据集，总时长约100万小时。我们依托 OLMoASR-Pool 衍生的 OLMoASR-Mix，训练了 OLMoASR 💬🎙️ 系列英语语音识别模型，并观测到其具备优异的泛化能力与鲁棒性表现！ # 数据集内容本数据集总时长约100万小时，涵盖多样的说话风格、口音与音频采集场景，包括新闻片段 📰、播客 🎙️、户外场景 🌳🏙️、多人环境 🧑‍🤝‍🧑、演讲 🎤、解说 🗣️、访谈 🤳 等多种类型。由于专为训练英语语音识别模型打造，OLMoASR-Mix 仅包含英语语料。 # 使用方法从 HuggingFace 平台下载数据集： 1. 前往指定页面获取 HuggingFace 访问令牌（HF access token）以获取数据集访问权限。 2. 在命令行执行 `pip install huggingface_hub[cli]` 安装依赖工具。 3. 在命令行运行 `huggingface-cli login`，粘贴获取到的 HF 访问令牌完成登录。 4. 可使用以下代码获取数据集ID： from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True) print(dataset) # features: ['id'] print(next(iter(dataset['train']))) 若需下载全部数据集ID，可使用以下代码： from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<自定义缓存目录>) 随后可根据ID信息下载音频与转录文本文件，并按照 OLMoASR 官方仓库的说明完成音频与转录文件的预处理。 # 适用领域本数据集最初用于训练语音识别模型，同时亦可应用于对话数据研究、音频理解、说话人分割（Speaker Diarization）、语音检测等多个研究方向。 # 许可证本数据集采用 ODC-BY 许可证发布，仅可用于研究与教育用途，并需遵循 AI2 的负责任使用指南。 # 参考文献 @misc{ngo2025olmoasropenmodelsdata, title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models}, author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt}, year={2025}, eprint={2508.20869}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2508.20869}, } # 联系方式若对本数据集有任何疑问，请联系 Huong Ngo，邮箱地址为 zoengo2002@gmail.com。

提供机构：

allenai

5,000+

优质数据集

54 个

任务类型

进入经典数据集