FlorianD/metavoice
收藏Hugging Face2023-12-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/FlorianD/metavoice
下载链接
链接失效反馈官方服务:
资源简介:
# Data Engineer: Take home project
## Introduction
The goal of this project is to evaluate your knowledge and skills in the design and implementation of a scalable data pre-processing pipeline.
## Problem statement
- Reads audio data being populated by Metavoice product, `Studio`, into a CloudFlare R2 bucket
- Runs two data transformation steps on the audio files:
- Transcription - use [Whisper](https://github.com/openai/whisper)
- Tokenisation - use mock code [here](https://gist.github.com/sidroopdaska/364e9f493d8dd9584eb9e1e9cae5715c)
- Stores the results using the example schema below.
```<id - relative path of audio file>, <transcription>, <token array>```
## Requirements
- Install `ffmpeg` by following instructions [here](https://www.hostinger.com/tutorials/how-to-install-ffmpeg)
- Use pipenv to install the required packages:
```pipenv install```
- Go to where the `main.py` file is located and run:
```python main.py ```
## Notes
For scalability, I decided to read the audio file with a given chunk_size, and so preprocess the audio file in chunks.
This is to avoid memory issues when dealing with large audio files.
The script is broken after a while (probably an audio file it does not like) as it shows:
```pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1```
I think there is a better solution, but by lack of time and not 100% sure if that feasible, that would be to:
- create a HuggingFace [loading-script](https://huggingface.co/docs/datasets/audio_dataset#loading-script)
- And so we could use the HF Dataset API to load the audio files and preprocess it.
- For the Whisper model, HF provide useful functions to [preprocess](https://huggingface.co/learn/audio-course/chapter1/preprocessing) it:
```
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
...
```
提供机构:
FlorianD
原始信息汇总
数据集概述
问题描述
- 读取由Metavoice产品
Studio生成的音频数据,存储到CloudFlare R2桶中。 - 对音频文件执行两个数据转换步骤:
- 使用以下示例模式存储结果: <id - 音频文件的相对路径>, <转录文本>, <标记数组>
要求
- 安装
ffmpeg,按照此处的说明进行。 - 使用pipenv安装所需的包: pipenv install
- 进入
main.py文件所在位置并运行: python main.py
注意事项
- 为了可扩展性,决定以给定的chunk_size读取音频文件,并以块为单位预处理音频文件,以避免处理大音频文件时的内存问题。
- 脚本在运行一段时间后会中断(可能是因为遇到了不喜欢的音频文件),显示错误: pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
- 建议解决方案:
- 创建一个HuggingFace的加载脚本。
- 使用HF Dataset API加载和预处理音频文件。
- 对于Whisper模型,HF提供了有用的预处理函数: python from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") ...



