five

allenai/OLMoASR-Mix

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/allenai/OLMoASR-Mix
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by --- OLMoASR-Mix is the curated version of OLMoASR-Pool, a web-scale audio-text dataset collected from the public internet. The dataset consists of approximately 1M hours of audio. With OLMoASR-Mix from OLMoASR-Pool, we trained OLMoASR 💬🎙️, a series of English speech recognition models and observed strong generalization and robust capabilities! # Content The dataset spans approximately 1M hours of audio. It also spans across a variety speaking styles, accents and audio setups such as news segments 📰, podcasts 🎙️, outdoors 🌳🏙️, crowds 🧑‍🤝‍🧑, speeches 🎤, commentary 🗣️, interviews 🤳 and more! OLMoASR-Mix is English-only as it has been curated for training English speech recognition models. # Usage Download from HuggingFace Retrieve HF access token from here to gain access to the dataset. Run pip install huggingface_hub[cli] Run huggingface-cli login in your CLI and paste the HF access token to login Use the code below to access the IDs ``` from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True) print(dataset) # features: ['id'] print(next(iter(dataset['train']))) ``` If you're downloading all the IDs, you can run the code below ``` from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<where you want to download the IDs to>) ``` Download the audio and transcript files from ID information. Preprocess the audio and transcript files. Follow the instructions at the OLMoASR repo. # Uses The collection was used to train a speech recognition model, but it can also be used in research areas such as conversational data, audio understanding, speaker diarization, voice detection and more. # License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines. # Reference ``` @misc{ngo2025olmoasropenmodelsdata, title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models}, author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt}, year={2025}, eprint={2508.20869}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2508.20869}, } ``` # Contact If you have any questions regarding the dataset, please contact Huong Ngo at zoengo2002@gmail.com.

license: odc-by --- OLMoASR-Mix 是 OLMoASR-Pool 的精选版本,后者是从公开互联网采集的网页级音频-文本数据集,总时长约100万小时。 我们依托 OLMoASR-Pool 衍生的 OLMoASR-Mix,训练了 OLMoASR 💬🎙️ 系列英语语音识别模型,并观测到其具备优异的泛化能力与鲁棒性表现! # 数据集内容 本数据集总时长约100万小时,涵盖多样的说话风格、口音与音频采集场景,包括新闻片段 📰、播客 🎙️、户外场景 🌳🏙️、多人环境 🧑‍🤝‍🧑、演讲 🎤、解说 🗣️、访谈 🤳 等多种类型。由于专为训练英语语音识别模型打造,OLMoASR-Mix 仅包含英语语料。 # 使用方法 从 HuggingFace 平台下载数据集: 1. 前往指定页面获取 HuggingFace 访问令牌(HF access token)以获取数据集访问权限。 2. 在命令行执行 `pip install huggingface_hub[cli]` 安装依赖工具。 3. 在命令行运行 `huggingface-cli login`,粘贴获取到的 HF 访问令牌完成登录。 4. 可使用以下代码获取数据集ID: from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=True) print(dataset) # features: ['id'] print(next(iter(dataset['train']))) 若需下载全部数据集ID,可使用以下代码: from datasets import load_dataset dataset = load_dataset("allenai/OLMoASR-Mix", streaming=False, cache_dir=<自定义缓存目录>) 随后可根据ID信息下载音频与转录文本文件,并按照 OLMoASR 官方仓库的说明完成音频与转录文件的预处理。 # 适用领域 本数据集最初用于训练语音识别模型,同时亦可应用于对话数据研究、音频理解、说话人分割(Speaker Diarization)、语音检测等多个研究方向。 # 许可证 本数据集采用 ODC-BY 许可证发布,仅可用于研究与教育用途,并需遵循 AI2 的负责任使用指南。 # 参考文献 @misc{ngo2025olmoasropenmodelsdata, title={OLMoASR: Open Models and Data for Training Robust Speech Recognition Models}, author={Huong Ngo and Matt Deitke and Martijn Bartelds and Sarah Pratt and Josh Gardner and Matt Jordan and Ludwig Schmidt}, year={2025}, eprint={2508.20869}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2508.20869}, } # 联系方式 若对本数据集有任何疑问,请联系 Huong Ngo,邮箱地址为 zoengo2002@gmail.com。
提供机构:
allenai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作