mobiuslabsgmbh/youtube-commons-asr-eval
收藏Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mobiuslabsgmbh/youtube-commons-asr-eval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
---
# Dataset Card for youtube-commons-asr-eval
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
### Dataset Summary
This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle.
### Supported Tasks and Leaderboards
This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard.
### Languages
This subset is for English language evaluations.
## Dataset Structure
The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments.
### Data Fields
Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields.
### Data Splits
Evaluation data
## Dataset Creation
Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py]
The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations.
### Licensing Information
All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].
提供机构:
mobiuslabsgmbh
原始信息汇总
数据集概述
数据集描述
数据集总结
- 该评估数据集是从Youtube-Commons的子集中选取的英语YouTube视频及其对应的英语字幕创建的。
支持的任务和排行榜
- 该数据集主要用于自动语音识别评估任务,如hf-audio/open_asr_leaderboard。
语言
- 该子集用于英语语言评估。
数据集结构
数据字段
- 每个JSON文件的行包含以下字段:链接(YouTube视频链接)、文本(转录)、norm_text(规范化转录)和时长(视频时长)。
数据分割
- 数据集为评估数据。
数据集创建
- 数据集是从Youtube-Commons中选取的前100个文件创建的,要求至少有300个转录单词且内容适合年龄。其中三个文件因转录错误率高而被手动移除。
- 规范化处理通过open_asr_eval中的EnglishTextNormalizer实现。
许可信息
- 所有转录文本均属于YouTube上共享的视频,遵循CC-By许可。所有许可条款与原始数据集[PleIAs/YouTube-Commons]相同。



