allenai/OLMoASR-Pool

Name: allenai/OLMoASR-Pool
Creator: allenai
Published: 2026-03-20 21:56:29
License: 暂无描述

Hugging Face2026-03-20 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/allenai/OLMoASR-Pool

下载链接

链接失效反馈

官方服务：

资源简介：

OLMoASR-Pool是一个从公共互联网收集的web-scale音频-文本数据集，包含大约3百万小时的音频和1700万条文本记录。数据集涵盖多种说话风格、口音和音频设置，包括新闻片段、播客、户外、人群、演讲、评论、访谈等。该数据集是多种语言的，但可以通过音频-文本语言对齐获取仅限英语的数据集。可用于训练语音识别模型及进行会话数据、音频理解、说话人分割、语音检测等研究。

OLMoASR-Pool is a web-scale audio-text dataset collected from the public internet, consisting of approximately 3 million hours of audio and 17 million transcripts. The dataset covers a variety of speaking styles, accents, and audio setups including news segments, podcasts, outdoors, crowds, speeches, commentary, interviews, and more. It is multilingual but can be aligned for English-only audio/transcripts through audio-text language alignment. It is suitable for training speech recognition models and for research in conversational data, audio understanding, speaker diarization, voice detection, and more.

提供机构：

allenai

5,000+

优质数据集

54 个

任务类型

进入经典数据集