FlorianD/metavoice

Name: FlorianD/metavoice
Creator: FlorianD
Published: 2023-12-06 15:36:38
License: 暂无描述

Hugging Face2023-12-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/FlorianD/metavoice

下载链接

链接失效反馈

官方服务：

资源简介：

# Data Engineer: Take home project ## Introduction The goal of this project is to evaluate your knowledge and skills in the design and implementation of a scalable data pre-processing pipeline. ## Problem statement - Reads audio data being populated by Metavoice product, `Studio`, into a CloudFlare R2 bucket - Runs two data transformation steps on the audio files: - Transcription - use [Whisper](https://github.com/openai/whisper) - Tokenisation - use mock code [here](https://gist.github.com/sidroopdaska/364e9f493d8dd9584eb9e1e9cae5715c) - Stores the results using the example schema below. ```<id - relative path of audio file>, <transcription>, <token array>``` ## Requirements - Install `ffmpeg` by following instructions [here](https://www.hostinger.com/tutorials/how-to-install-ffmpeg) - Use pipenv to install the required packages: ```pipenv install``` - Go to where the `main.py` file is located and run: ```python main.py ``` ## Notes For scalability, I decided to read the audio file with a given chunk_size, and so preprocess the audio file in chunks. This is to avoid memory issues when dealing with large audio files. The script is broken after a while (probably an audio file it does not like) as it shows: ```pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1``` I think there is a better solution, but by lack of time and not 100% sure if that feasible, that would be to: - create a HuggingFace [loading-script](https://huggingface.co/docs/datasets/audio_dataset#loading-script) - And so we could use the HF Dataset API to load the audio files and preprocess it. - For the Whisper model, HF provide useful functions to [preprocess](https://huggingface.co/learn/audio-course/chapter1/preprocessing) it: ``` from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") ... ```

提供机构：

FlorianD

原始信息汇总

数据集概述

问题描述

读取由Metavoice产品Studio生成的音频数据，存储到CloudFlare R2桶中。
对音频文件执行两个数据转换步骤：
- 转录（Transcription）：使用Whisper。
- 标记化（Tokenisation）：使用模拟代码此处。
使用以下示例模式存储结果： <id - 音频文件的相对路径>, <转录文本>, <标记数组>

要求

安装ffmpeg，按照此处的说明进行。
使用pipenv安装所需的包： pipenv install
进入main.py文件所在位置并运行： python main.py

注意事项

为了可扩展性，决定以给定的chunk_size读取音频文件，并以块为单位预处理音频文件，以避免处理大音频文件时的内存问题。
脚本在运行一段时间后会中断（可能是因为遇到了不喜欢的音频文件），显示错误： pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
建议解决方案：
- 创建一个HuggingFace的加载脚本。
- 使用HF Dataset API加载和预处理音频文件。
- 对于Whisper模型，HF提供了有用的预处理函数： python from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small") ...

5,000+

优质数据集

54 个

任务类型

进入经典数据集