allenai/OLMoASR-Pool
收藏Hugging Face2026-03-20 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/allenai/OLMoASR-Pool
下载链接
链接失效反馈官方服务:
资源简介:
OLMoASR-Pool是一个从公共互联网收集的web-scale音频-文本数据集,包含大约3百万小时的音频和1700万条文本记录。数据集涵盖多种说话风格、口音和音频设置,包括新闻片段、播客、户外、人群、演讲、评论、访谈等。该数据集是多种语言的,但可以通过音频-文本语言对齐获取仅限英语的数据集。可用于训练语音识别模型及进行会话数据、音频理解、说话人分割、语音检测等研究。
OLMoASR-Pool is a web-scale audio-text dataset collected from the public internet, consisting of approximately 3 million hours of audio and 17 million transcripts. The dataset covers a variety of speaking styles, accents, and audio setups including news segments, podcasts, outdoors, crowds, speeches, commentary, interviews, and more. It is multilingual but can be aligned for English-only audio/transcripts through audio-text language alignment. It is suitable for training speech recognition models and for research in conversational data, audio understanding, speaker diarization, voice detection, and more.
提供机构:
allenai



