mobiuslabsgmbh/youtube-commons-asr-eval

Name: mobiuslabsgmbh/youtube-commons-asr-eval
Creator: mobiuslabsgmbh
Published: 2024-05-02 13:51:51
License: 暂无描述

Hugging Face2024-05-02 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/mobiuslabsgmbh/youtube-commons-asr-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en --- # Dataset Card for youtube-commons-asr-eval ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) ### Dataset Summary This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle. ### Supported Tasks and Leaderboards This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard. ### Languages This subset is for English language evaluations. ## Dataset Structure The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments. ### Data Fields Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields. ### Data Splits Evaluation data ## Dataset Creation Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py] The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations. ### Licensing Information All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].

提供机构：

mobiuslabsgmbh

原始信息汇总

数据集概述

数据集描述

数据集总结

该评估数据集是从Youtube-Commons的子集中选取的英语YouTube视频及其对应的英语字幕创建的。

支持的任务和排行榜

该数据集主要用于自动语音识别评估任务，如hf-audio/open_asr_leaderboard。

语言

该子集用于英语语言评估。

数据集结构

数据字段

每个JSON文件的行包含以下字段：链接（YouTube视频链接）、文本（转录）、norm_text（规范化转录）和时长（视频时长）。

数据分割

数据集为评估数据。

数据集创建

数据集是从Youtube-Commons中选取的前100个文件创建的，要求至少有300个转录单词且内容适合年龄。其中三个文件因转录错误率高而被手动移除。
规范化处理通过open_asr_eval中的EnglishTextNormalizer实现。

许可信息

所有转录文本均属于YouTube上共享的视频，遵循CC-By许可。所有许可条款与原始数据集[PleIAs/YouTube-Commons]相同。

5,000+

优质数据集

54 个

任务类型

进入经典数据集