mesolitica/pseudolabel-malaysian-youtube-whisper-large-v3
收藏Hugging Face2024-01-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/pseudolabel-malaysian-youtube-whisper-large-v3
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ms
task_categories:
- automatic-speech-recognition
---
# Pseudolabel Malaysian Youtube videos using Whisper Large V3
Original dataset at https://huggingface.co/datasets/malaysia-ai/crawl-youtube, distributed pseudolabelled using 4x A100s
script at https://github.com/mesolitica/malaysian-dataset/tree/master/speech-to-text-semisupervised/pseudolabel-whisper
1. Each audio is 30 seconds.
2. Each audio saved in 16k sample rate.
提供机构:
mesolitica
原始信息汇总
数据集概述
语言
- 马来语 (ms)
任务类别
- 自动语音识别 (automatic-speech-recognition)
数据集描述
- 数据集来源于马来西亚YouTube视频,使用Whisper Large V3模型进行伪标签处理。
- 原始数据集链接:malaysia-ai/crawl-youtube
- 伪标签处理脚本链接:mesolitica/malaysian-dataset
数据格式
- 每个音频时长为30秒。
- 每个音频以16k采样率保存。



