下载链接：

https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

# README ## Introduction This repository hosts Ming-Freeform-Audio-Edit, the benchmark test set for evaluating the downstream editing tasks of the Ming-UniAudio model. This test set covers 7 distinct editing tasks, categorized as follows: + Semantic Editing (3 tasks): + Free-form Deletion + Free-form Insertion + Free-form Substitution + Acoustic Editing (5 tasks): + Time-stretching + Pitch Shifting + Dialect Conversion + Emotion Conversion + Volume Conversion The audio samples are sourced from well-known open-source datasets, including seed-tts eval, LibriTTS, and Gigaspeech. ## Dataset statistics ### Semantic Editing #### full version | Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 186 | 180 | 36 | 138 | 100 | 67 | | Content-based | 95 | 110 | 289 | 62 | 99 | 189 | | Total | 281 | 290 | 325 | 200 | 199 | 256 | #### basic version | Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 92 | 65 | 29 | 47 | 79 | 29 | | Content-based | 78 | 105 | 130 | 133 | 81 | 150 | | Total | 170 | 170 | 159 | 180 | 160 | 179 | *Index-based* instruction: specifies an operation on content at positions *i* to *j*. (e.g. delete the characters or words from index 3 to 12) *Content-based*: targets specific characters or words for editing. (e.g. insert 'hello' before 'world') ### Acoustic Editing | Task Types\ # samples \ Language | Zh | En | | -------------------------------- | ---: | ---: | | Time-stretching | 50 | 50 | | Pitch Shifting | 50 | 50 | | Dialect Conversion | 250 | --- | | Emotion Conversion | 84 | 72 | | Volume Conversion | 50 | 50 | ## Evaluation Metrics ### Environment Preparation ```bash git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git cd Ming-Freeform-Audio-Edit pip install -r requirements.txt ``` **Note**: Please download the audio and meta files from [HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main) or [ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files) and put the `wavs` and `meta` directories under `Ming-Freeform-Audio-Edit` ### Semantic Editing For the deletion, insertion, and substitution tasks, we evaluate the performance using four key metrics: + Word Error Rate (WER) of the Edited Region (wer) + Word Error Rate (WER) of the Non-edited Region (wer.noedit) + Edit Operation Accuracy (acc) + Speaker Similarity (sim) 1. If you have organized the directories contain edited waveforms like below: ``` eval_path | ├── del │ └── edit_del_basic │ └── tts/ # This is the actual directory contains the edited wavs ├── ins │ └── edit_ins_basic │ └── tts/ # This is the actual directory contains the edited wavs ├── sub └── edit_sub_basic └── tts/ # This is the actual directory contains the edited wavs ``` Then you can run the following command to get those metrics: ```bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_semantic.sh eval_path \ whisper_path \ paraformer_path \ wavlm_path \ eval_mode \ lang ``` Here is a brief description of the parameters for the script above: + `eval_path`: The top-level directory containing subdirectories for each editing task + `whisper_path`:Path to the Whisper model, which is used to calculate WER for English audio. You can download it from [here](https://huggingface.co/openai/whisper-large-v3). + `paraformer_path`:Path to the Paraformer model, which is used to calculate WER for Chinese audio. You can download it from [here](https://huggingface.co/funasr/paraformer-zh). + `wavlm_path`: Path to the WavLM model, which is used to calculate speaker similarity. You can download it from [here](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view). + `eval_mode`: Used to specify which version of the evaluation set to use. Choose between `basic` and `open` + `lang`: supported language, choose between `zh` and `en` 2. If your directory for the edited audio is not organized in the format described above, you can run the following commands. ```bash cd eval_scripts # get wer, wer.noedit bash cal_wer_edit.sh meta_file \ wav_dir \ lang \ num_jobs \ res_dir \ task_type \ eval_mode \ whisper_path \ paraformer_path \ edit_cat # use `semantic` here # get sim bash cal_sim_edit.sh meta_file \ wav_dir \ wavlm_path \ num_jobs \ res_dir \ lang ``` Here is a brief description of the parameters for the script above: + `meta_file`: The absolute path to the meta file for the corresponding task (e.g., `meta_en_deletion_basic.csv` or `meta_en_deletion.csv`). + `wav_dir`: The directory containing the edited audio files (the WAV files should be located directly in this directory). + `lang`: `zh` or `en` + `num_jobs`: number of process. + `res_dir`: The directory to save the metric results. + `task_type`: `del`, `ins` or `sub` + `eval_mode`: The same as the above. + `whisper_path`: The same as the above + `paraformer_path`: The same as the above + `edit_cat`: `semantic` or `acoustic` ### Acoustic Editing For the acoustic editing tasks, we use WER and SPK-SIM as the primary evaluation metrics. 1. If the directory for the edited audio is structured, you can run the following command. ```bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_acoustic.sh eval_path \ whisper_path \ paraformer_path \ wavlm_path \ eval_mode \ lang ``` 2. Otherwise, you can run commands similar to the one for the semantic tasks, with the `edit_cat` parameter set to `acoustic`. Additionally, for the dialect and emotion conversion tasks, we assess the conversion accuracy by leveraging a large language model (LLM) through API calls, refer to `eval_scripts/run_eval_acoustic.sh` for more details.

# 说明文档（README） ## 简介本仓库托管**Ming-Freeform-Audio-Edit**数据集，该数据集是用于评估Ming-UniAudio模型下游音频编辑任务的基准测试集。该测试集涵盖7类不同的音频编辑任务，分类如下： + 语义编辑（3项子任务）： + 自由式删除（Free-form Deletion） + 自由式插入（Free-form Insertion） + 自由式替换（Free-form Substitution） + 声学编辑（5项子任务）： + 时间拉伸（Time-stretching） + 音调偏移（Pitch Shifting） + 方言转换（Dialect Conversion） + 情感转换（Emotion Conversion） + 音量转换（Volume Conversion）该数据集的音频样本源自多个知名开源数据集，包括seed-tts eval、LibriTTS以及Gigaspeech。 ## 数据集统计 ### 语义编辑 #### 完整版 | 任务类型#样本数语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | 基于索引（Index-based） | 186 | 180 | 36 | 138 | 100 | 67 | | 基于内容（Content-based） | 95 | 110 | 289 | 62 | 99 | 189 | | 总计（Total） | 281 | 290 | 325 | 200 | 199 | 256 | #### 基础版 | 任务类型#样本数语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | 基于索引（Index-based） | 92 | 65 | 29 | 47 | 79 | 29 | | 基于内容（Content-based） | 78 | 105 | 130 | 133 | 81 | 150 | | 总计（Total） | 170 | 170 | 159 | 180 | 160 | 179 | **基于索引**指令：指定对位置*i*至*j*处的内容执行编辑操作（例如删除索引3至12处的字符或单词）。 **基于内容**：针对特定字符或单词执行编辑操作（例如在"world"前插入"hello"）。 ### 声学编辑 | 任务类型#样本数语言 | 中文 | 英文 | | -------------------------------- | ---: | ---: | | 时间拉伸（Time-stretching） | 50 | 50 | | 音调偏移（Pitch Shifting） | 50 | 50 | | 方言转换（Dialect Conversion） | 250 | --- | | 情感转换（Emotion Conversion） | 84 | 72 | | 音量转换（Volume Conversion） | 50 | 50 | ## 评估指标 ### 环境配置 bash git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git cd Ming-Freeform-Audio-Edit pip install -r requirements.txt **注意**：请从[HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main)或[ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files)下载音频与元数据文件，并将`wavs`和`meta`目录放置于`Ming-Freeform-Audio-Edit`目录下。 ### 语义编辑针对删除、插入与替换任务，我们采用四项核心指标评估模型性能： + 编辑区域词错误率（Word Error Rate，WER，简称wer） + 非编辑区域词错误率（Word Error Rate，WER，简称wer.noedit） + 编辑操作准确率（Edit Operation Accuracy，简称acc） + 说话人相似度（Speaker Similarity，简称sim） 1. 若您的编辑后音频波形目录结构如下所示： eval_path | ├── del │ └── edit_del_basic │ └── tts/ # 此处为存放编辑后音频文件的实际目录 ├── ins │ └── edit_ins_basic │ └── tts/ # 此处为存放编辑后音频文件的实际目录 ├── sub └── edit_sub_basic └── tts/ # 此处为存放编辑后音频文件的实际目录则可运行以下命令获取上述指标： bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_semantic.sh eval_path whisper_path paraformer_path wavlm_path eval_mode lang 下文为该脚本参数的简要说明： + `eval_path`：包含各编辑任务子目录的顶级目录 + `whisper_path`：Whisper模型的路径，用于计算英文音频的词错误率，可从[此处](https://huggingface.co/openai/whisper-large-v3)下载 + `paraformer_path`：Paraformer模型的路径，用于计算中文音频的词错误率，可从[此处](https://huggingface.co/funasr/paraformer-zh)下载 + `wavlm_path`：WavLM模型的路径，用于计算说话人相似度，可从[此处](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view)下载 + `eval_mode`：用于指定使用哪个版本的测试集，可选值为`basic`（基础版）与`open`（开放版） + `lang`：支持的语言，可选值为`zh`（中文）与`en`（英文） 2. 若您的编辑后音频目录未按上述格式组织，则可运行以下命令： bash cd eval_scripts # 计算wer、wer.noedit bash cal_wer_edit.sh meta_file wav_dir lang num_jobs res_dir task_type eval_mode whisper_path paraformer_path edit_cat # 此处设置为`semantic` # 计算sim bash cal_sim_edit.sh meta_file wav_dir wavlm_path num_jobs res_dir lang 下文为上述命令参数的简要说明： + `meta_file`：对应任务的元数据文件的绝对路径（例如`meta_en_deletion_basic.csv`或`meta_en_deletion.csv`） + `wav_dir`：存放编辑后音频文件的目录（WAV文件需直接置于该目录下） + `lang`：可选`zh`（中文）或`en`（英文） + `num_jobs`：并行进程数 + `res_dir`：用于保存指标结果的目录 + `task_type`：任务类型，可选`del`（删除）、`ins`（插入）或`sub`（替换） + `eval_mode`：与前文定义一致 + `whisper_path`：与前文定义一致 + `paraformer_path`：与前文定义一致 + `edit_cat`：编辑类别，可选`semantic`（语义编辑）或`acoustic`（声学编辑） ### 声学编辑针对声学编辑任务，我们采用词错误率（WER）与说话人相似度（SPK-SIM）作为核心评估指标。 1. 若编辑后音频目录结构规范，则可运行以下命令： bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_acoustic.sh eval_path whisper_path paraformer_path wavlm_path eval_mode lang 2. 反之，您可运行与语义编辑任务类似的命令，仅需将`edit_cat`参数设置为`acoustic`（声学编辑）。此外，针对方言转换与情感转换任务，我们通过调用大语言模型（LLM）的API来评估转换准确率，详细步骤可参考`eval_scripts/run_eval_acoustic.sh`文件。

应用场景：