youtube-commons-asr-eval

Name: youtube-commons-asr-eval
Creator: maas
Published: 2025-12-05 16:21:40
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/mobiuslabsgmbh/youtube-commons-asr-eval

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for youtube-commons-asr-eval ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) ### Dataset Summary This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle. ### Supported Tasks and Leaderboards This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard. ### Languages This subset is for English language evaluations. ## Dataset Structure The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments. ### Data Fields Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields. ### Data Splits Evaluation data ## Dataset Creation Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py] The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations. ### Licensing Information All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].

# youtube-commons-asr-eval 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准测试榜单](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [附加信息](#additional-information) - [许可信息](#licensing-information) ### 数据集摘要本评测数据集源自Youtube-Commons数据集[PleIAs/YouTube-Commons]的子集，通过筛选英语YouTube视频及其对应英文字幕构建而成。 ### 支持任务与基准测试榜单本数据集主要适用于自动语音识别（Automatic Speech Recognition, ASR）评测任务，例如hf-audio/open_asr_leaderboard。 ### 语言覆盖本子集面向英语语言评测场景。 ## 数据集结构本数据集包含94条符合年龄适宜性要求的音频的视频链接、原始转录文本与归一化转录文本，总时长约38小时，且每条转录文本的单词数不少于300。按照正常语速2.5词/秒计算，单条转录的最短时长对应2分钟。本数据集的音频最短时长为128秒，最长时长为2小时08分。单文件平均时长略超24分钟，标准差为25分钟。由标准差体现出的音频时长显著差异，贴合真实场景下的典型分布特征。 ### 数据字段 JSON文件中的每一行均包含以下字段：link（YouTube视频链接）、text（原始转录文本）、norm_text（归一化转录文本）以及duration（视频时长）。 ### 数据划分仅包含评测数据集。 ## 数据集构建归一化处理采用open_asr_eval中的EnglishTextNormalizer工具完成，工具地址为[https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py]。本数据集通过从Youtube-Commons中筛选前100条满足转录单词数不少于300且内容符合年龄适宜性要求的文件构建而成。经人工核查，其中3条文件的转录文本错误率较高，且在多款自动语音识别实现中验证得到了较高的词错误率（Word Error Rate, WER），故予以移除。 ### 许可信息本数据集所有转录文本均源自YouTube平台上采用知识共享署名（Creative Commons Attribution, CC-BY）许可协议分享的视频，其许可条款与原始数据集[PleIAs/YouTube-Commons]完全一致。

提供机构：

maas

创建时间：

2025-01-25

搜集汇总

数据集介绍