five

mobiuslabsgmbh/youtube-commons-asr-eval

收藏
Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mobiuslabsgmbh/youtube-commons-asr-eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en --- # Dataset Card for youtube-commons-asr-eval ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) ### Dataset Summary This evaluation dataset is created from a subset of Youtube-Commons [PleIAs/YouTube-Commons] by selecting English YouTube videos and corresponding english subtitle. ### Supported Tasks and Leaderboards This dataset will be primarily useful for automatic speech recognition evaluation tasks such as hf-audio/open_asr_leaderboard. ### Languages This subset is for English language evaluations. ## Dataset Structure The dataset consists of 94 video links, transcriptions, and normalized transcriptions (around 38 hours) of age-appropriate audios with a minimum word count of 300. With a normal speaking rate of 2.5 words per second, this corresponds to a minimum duration of 2 minutes. Minimum duration of the dataset is 128 seconds and maximum is 02:08 hours. The average duration per file is a little over 24 minutes and the standard deviation is 25 minutes. The notable variability in audio duration, as indicated by the standard deviation, mirrors typical real-time environments. ### Data Fields Each row in the JSON file has link (link to the youtube video), text (transcription), norm_text (normalized transcription) and duration (duration of the video) fields. ### Data Splits Evaluation data ## Dataset Creation Normalization is done via EnglishTextNormalizer from open_asr_eval [https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py] The dataset is created by selecting the first 100 files from Youtube-Commons, with a minimum of 300 transcription words and age-appropriate content. Three files are manually removed owing to high errors in the transcription observed in visual inspection and also verified with high WER on different ASR implementations. ### Licensing Information All the transcripts are part of a video shared under a CC-By license on YouTube. All the licensing terms are the same as the original dataset [PleIAs/YouTube-Commons].
提供机构:
mobiuslabsgmbh
原始信息汇总

数据集概述

数据集描述

数据集总结

  • 该评估数据集是从Youtube-Commons的子集中选取的英语YouTube视频及其对应的英语字幕创建的。

支持的任务和排行榜

  • 该数据集主要用于自动语音识别评估任务,如hf-audio/open_asr_leaderboard。

语言

  • 该子集用于英语语言评估。

数据集结构

数据字段

  • 每个JSON文件的行包含以下字段:链接(YouTube视频链接)、文本(转录)、norm_text(规范化转录)和时长(视频时长)。

数据分割

  • 数据集为评估数据。

数据集创建

  • 数据集是从Youtube-Commons中选取的前100个文件创建的,要求至少有300个转录单词且内容适合年龄。其中三个文件因转录错误率高而被手动移除。
  • 规范化处理通过open_asr_eval中的EnglishTextNormalizer实现。

许可信息

  • 所有转录文本均属于YouTube上共享的视频,遵循CC-By许可。所有许可条款与原始数据集[PleIAs/YouTube-Commons]相同。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作