five

youtube-commons-asr-eval

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/mobiuslabsgmbh/youtube-commons-asr-eval
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for youtube-commons-asr-eval ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) ### Dataset Summary This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle. ### Supported Tasks and Leaderboards This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard. ### Languages This subset is for English language evaluations. ## Dataset Structure The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments. ### Data Fields Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields. ### Data Splits Evaluation data ## Dataset Creation Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py] The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations. ### Licensing Information All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].

# youtube-commons-asr-eval 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准测试榜单](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [附加信息](#additional-information) - [许可信息](#licensing-information) ### 数据集摘要 本评测数据集源自Youtube-Commons数据集[PleIAs/YouTube-Commons]的子集,通过筛选英语YouTube视频及其对应英文字幕构建而成。 ### 支持任务与基准测试榜单 本数据集主要适用于自动语音识别(Automatic Speech Recognition, ASR)评测任务,例如hf-audio/open_asr_leaderboard。 ### 语言覆盖 本子集面向英语语言评测场景。 ## 数据集结构 本数据集包含94条符合年龄适宜性要求的音频的视频链接、原始转录文本与归一化转录文本,总时长约38小时,且每条转录文本的单词数不少于300。按照正常语速2.5词/秒计算,单条转录的最短时长对应2分钟。本数据集的音频最短时长为128秒,最长时长为2小时08分。单文件平均时长略超24分钟,标准差为25分钟。由标准差体现出的音频时长显著差异,贴合真实场景下的典型分布特征。 ### 数据字段 JSON文件中的每一行均包含以下字段:link(YouTube视频链接)、text(原始转录文本)、norm_text(归一化转录文本)以及duration(视频时长)。 ### 数据划分 仅包含评测数据集。 ## 数据集构建 归一化处理采用open_asr_eval中的EnglishTextNormalizer工具完成,工具地址为[https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py]。本数据集通过从Youtube-Commons中筛选前100条满足转录单词数不少于300且内容符合年龄适宜性要求的文件构建而成。经人工核查,其中3条文件的转录文本错误率较高,且在多款自动语音识别实现中验证得到了较高的词错误率(Word Error Rate, WER),故予以移除。 ### 许可信息 本数据集所有转录文本均源自YouTube平台上采用知识共享署名(Creative Commons Attribution, CC-BY)许可协议分享的视频,其许可条款与原始数据集[PleIAs/YouTube-Commons]完全一致。
提供机构:
maas
创建时间:
2025-01-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作