youtube-commons-asr-eval
收藏魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/mobiuslabsgmbh/youtube-commons-asr-eval
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for youtube-commons-asr-eval
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
### Dataset Summary
This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle.
### Supported Tasks and Leaderboards
This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard.
### Languages
This subset is for English language evaluations.
## Dataset Structure
The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments.
### Data Fields
Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields.
### Data Splits
Evaluation data
## Dataset Creation
Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py]
The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations.
### Licensing Information
All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].
# youtube-commons-asr-eval 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与基准测试榜单](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [附加信息](#additional-information)
- [许可信息](#licensing-information)
### 数据集摘要
本评测数据集源自Youtube-Commons数据集[PleIAs/YouTube-Commons]的子集,通过筛选英语YouTube视频及其对应英文字幕构建而成。
### 支持任务与基准测试榜单
本数据集主要适用于自动语音识别(Automatic Speech Recognition, ASR)评测任务,例如hf-audio/open_asr_leaderboard。
### 语言覆盖
本子集面向英语语言评测场景。
## 数据集结构
本数据集包含94条符合年龄适宜性要求的音频的视频链接、原始转录文本与归一化转录文本,总时长约38小时,且每条转录文本的单词数不少于300。按照正常语速2.5词/秒计算,单条转录的最短时长对应2分钟。本数据集的音频最短时长为128秒,最长时长为2小时08分。单文件平均时长略超24分钟,标准差为25分钟。由标准差体现出的音频时长显著差异,贴合真实场景下的典型分布特征。
### 数据字段
JSON文件中的每一行均包含以下字段:link(YouTube视频链接)、text(原始转录文本)、norm_text(归一化转录文本)以及duration(视频时长)。
### 数据划分
仅包含评测数据集。
## 数据集构建
归一化处理采用open_asr_eval中的EnglishTextNormalizer工具完成,工具地址为[https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py]。本数据集通过从Youtube-Commons中筛选前100条满足转录单词数不少于300且内容符合年龄适宜性要求的文件构建而成。经人工核查,其中3条文件的转录文本错误率较高,且在多款自动语音识别实现中验证得到了较高的词错误率(Word Error Rate, WER),故予以移除。
### 许可信息
本数据集所有转录文本均源自YouTube平台上采用知识共享署名(Creative Commons Attribution, CC-BY)许可协议分享的视频,其许可条款与原始数据集[PleIAs/YouTube-Commons]完全一致。
提供机构:
maas
创建时间:
2025-01-25



