five

FlorianD/metavoice

收藏
Hugging Face2023-12-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/FlorianD/metavoice
下载链接
链接失效反馈
官方服务:
资源简介:
# Data Engineer: Take home project ## Introduction The goal of this project is to evaluate your knowledge and skills in the design and implementation of a scalable data pre-processing pipeline. ## Problem statement - Reads audio data being populated by Metavoice product, `Studio`, into a CloudFlare R2 bucket - Runs two data transformation steps on the audio files: - Transcription - use [Whisper](https://github.com/openai/whisper) - Tokenisation - use mock code [here](https://gist.github.com/sidroopdaska/364e9f493d8dd9584eb9e1e9cae5715c) - Stores the results using the example schema below. ```<id - relative path of audio file>, <transcription>, <token array>``` ## Requirements - Install `ffmpeg` by following instructions [here](https://www.hostinger.com/tutorials/how-to-install-ffmpeg) - Use pipenv to install the required packages: ```pipenv install``` - Go to where the `main.py` file is located and run: ```python main.py ``` ## Notes For scalability, I decided to read the audio file with a given chunk_size, and so preprocess the audio file in chunks. This is to avoid memory issues when dealing with large audio files. The script is broken after a while (probably an audio file it does not like) as it shows: ```pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1``` I think there is a better solution, but by lack of time and not 100% sure if that feasible, that would be to: - create a HuggingFace [loading-script](https://huggingface.co/docs/datasets/audio_dataset#loading-script) - And so we could use the HF Dataset API to load the audio files and preprocess it. - For the Whisper model, HF provide useful functions to [preprocess](https://huggingface.co/learn/audio-course/chapter1/preprocessing) it: ``` from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") ... ```
提供机构:
FlorianD
原始信息汇总

数据集概述

问题描述

  • 读取由Metavoice产品Studio生成的音频数据,存储到CloudFlare R2桶中。
  • 对音频文件执行两个数据转换步骤:
    • 转录(Transcription):使用Whisper
    • 标记化(Tokenisation):使用模拟代码此处
  • 使用以下示例模式存储结果: <id - 音频文件的相对路径>, <转录文本>, <标记数组>

要求

  • 安装ffmpeg,按照此处的说明进行。
  • 使用pipenv安装所需的包: pipenv install
  • 进入main.py文件所在位置并运行: python main.py

注意事项

  • 为了可扩展性,决定以给定的chunk_size读取音频文件,并以块为单位预处理音频文件,以避免处理大音频文件时的内存问题。
  • 脚本在运行一段时间后会中断(可能是因为遇到了不喜欢的音频文件),显示错误: pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
  • 建议解决方案:
    • 创建一个HuggingFace的加载脚本
    • 使用HF Dataset API加载和预处理音频文件。
    • 对于Whisper模型,HF提供了有用的预处理函数: python from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") ...
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作