five

Ming-Freeform-Audio-Edit-Benchmark

收藏
魔搭社区2026-04-28 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
# README ## Introduction This repository hosts Ming-Freeform-Audio-Edit, the benchmark test set for evaluating the downstream editing tasks of the Ming-UniAudio model. This test set covers 7 distinct editing tasks, categorized as follows: + Semantic Editing (3 tasks): + Free-form Deletion + Free-form Insertion + Free-form Substitution + Acoustic Editing (5 tasks): + Time-stretching + Pitch Shifting + Dialect Conversion + Emotion Conversion + Volume Conversion The audio samples are sourced from well-known open-source datasets, including seed-tts eval, LibriTTS, and Gigaspeech. ## Dataset statistics ### Semantic Editing #### full version | Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 186 | 180 | 36 | 138 | 100 | 67 | | Content-based | 95 | 110 | 289 | 62 | 99 | 189 | | Total | 281 | 290 | 325 | 200 | 199 | 256 | #### basic version | Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | Index-based | 92 | 65 | 29 | 47 | 79 | 29 | | Content-based | 78 | 105 | 130 | 133 | 81 | 150 | | Total | 170 | 170 | 159 | 180 | 160 | 179 | *Index-based* instruction: specifies an operation on content at positions *i* to *j*. (e.g. delete the characters or words from index 3 to 12) *Content-based*: targets specific characters or words for editing. (e.g. insert 'hello' before 'world') ### Acoustic Editing | Task Types\ # samples \ Language | Zh | En | | -------------------------------- | ---: | ---: | | Time-stretching | 50 | 50 | | Pitch Shifting | 50 | 50 | | Dialect Conversion | 250 | --- | | Emotion Conversion | 84 | 72 | | Volume Conversion | 50 | 50 | ## Evaluation Metrics ### Environment Preparation ```bash git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git cd Ming-Freeform-Audio-Edit pip install -r requirements.txt ``` **Note**: Please download the audio and meta files from [HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main) or [ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files) and put the `wavs` and `meta` directories under `Ming-Freeform-Audio-Edit` ### Semantic Editing For the deletion, insertion, and substitution tasks, we evaluate the performance using four key metrics: + Word Error Rate (WER) of the Edited Region (wer) + Word Error Rate (WER) of the Non-edited Region (wer.noedit) + Edit Operation Accuracy (acc) + Speaker Similarity (sim) 1. If you have organized the directories contain edited waveforms like below: ``` eval_path | ├── del │ └── edit_del_basic │ └── tts/ # This is the actual directory contains the edited wavs ├── ins │ └── edit_ins_basic │ └── tts/ # This is the actual directory contains the edited wavs ├── sub └── edit_sub_basic └── tts/ # This is the actual directory contains the edited wavs ``` Then you can run the following command to get those metrics: ```bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_semantic.sh eval_path \ whisper_path \ paraformer_path \ wavlm_path \ eval_mode \ lang ``` Here is a brief description of the parameters for the script above: + `eval_path`: The top-level directory containing subdirectories for each editing task + `whisper_path`:Path to the Whisper model, which is used to calculate WER for English audio. You can download it from [here](https://huggingface.co/openai/whisper-large-v3). + `paraformer_path`:Path to the Paraformer model, which is used to calculate WER for Chinese audio. You can download it from [here](https://huggingface.co/funasr/paraformer-zh). + `wavlm_path`: Path to the WavLM model, which is used to calculate speaker similarity. You can download it from [here](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view). + `eval_mode`: Used to specify which version of the evaluation set to use. Choose between `basic` and `open` + `lang`: supported language, choose between `zh` and `en` 2. If your directory for the edited audio is not organized in the format described above, you can run the following commands. ```bash cd eval_scripts # get wer, wer.noedit bash cal_wer_edit.sh meta_file \ wav_dir \ lang \ num_jobs \ res_dir \ task_type \ eval_mode \ whisper_path \ paraformer_path \ edit_cat # use `semantic` here # get sim bash cal_sim_edit.sh meta_file \ wav_dir \ wavlm_path \ num_jobs \ res_dir \ lang ``` Here is a brief description of the parameters for the script above: + `meta_file`: The absolute path to the meta file for the corresponding task (e.g., `meta_en_deletion_basic.csv` or `meta_en_deletion.csv`). + `wav_dir`: The directory containing the edited audio files (the WAV files should be located directly in this directory). + `lang`: `zh` or `en` + `num_jobs`: number of process. + `res_dir`: The directory to save the metric results. + `task_type`: `del`, `ins` or `sub` + `eval_mode`: The same as the above. + `whisper_path`: The same as the above + `paraformer_path`: The same as the above + `edit_cat`: `semantic` or `acoustic` ### Acoustic Editing For the acoustic editing tasks, we use WER and SPK-SIM as the primary evaluation metrics. 1. If the directory for the edited audio is structured, you can run the following command. ```bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_acoustic.sh eval_path \ whisper_path \ paraformer_path \ wavlm_path \ eval_mode \ lang ``` 2. Otherwise, you can run commands similar to the one for the semantic tasks, with the `edit_cat` parameter set to `acoustic`. Additionally, for the dialect and emotion conversion tasks, we assess the conversion accuracy by leveraging a large language model (LLM) through API calls, refer to `eval_scripts/run_eval_acoustic.sh` for more details.

# 说明文档(README) ## 简介 本仓库托管**Ming-Freeform-Audio-Edit**数据集,该数据集是用于评估Ming-UniAudio模型下游音频编辑任务的基准测试集。 该测试集涵盖7类不同的音频编辑任务,分类如下: + 语义编辑(3项子任务): + 自由式删除(Free-form Deletion) + 自由式插入(Free-form Insertion) + 自由式替换(Free-form Substitution) + 声学编辑(5项子任务): + 时间拉伸(Time-stretching) + 音调偏移(Pitch Shifting) + 方言转换(Dialect Conversion) + 情感转换(Emotion Conversion) + 音量转换(Volume Conversion) 该数据集的音频样本源自多个知名开源数据集,包括seed-tts eval、LibriTTS以及Gigaspeech。 ## 数据集统计 ### 语义编辑 #### 完整版 | 任务类型#样本数 语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | 基于索引(Index-based) | 186 | 180 | 36 | 138 | 100 | 67 | | 基于内容(Content-based) | 95 | 110 | 289 | 62 | 99 | 189 | | 总计(Total) | 281 | 290 | 325 | 200 | 199 | 256 | #### 基础版 | 任务类型#样本数 语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 | | -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: | | 基于索引(Index-based) | 92 | 65 | 29 | 47 | 79 | 29 | | 基于内容(Content-based) | 78 | 105 | 130 | 133 | 81 | 150 | | 总计(Total) | 170 | 170 | 159 | 180 | 160 | 179 | **基于索引**指令:指定对位置*i*至*j*处的内容执行编辑操作(例如删除索引3至12处的字符或单词)。 **基于内容**:针对特定字符或单词执行编辑操作(例如在"world"前插入"hello")。 ### 声学编辑 | 任务类型#样本数 语言 | 中文 | 英文 | | -------------------------------- | ---: | ---: | | 时间拉伸(Time-stretching) | 50 | 50 | | 音调偏移(Pitch Shifting) | 50 | 50 | | 方言转换(Dialect Conversion) | 250 | --- | | 情感转换(Emotion Conversion) | 84 | 72 | | 音量转换(Volume Conversion) | 50 | 50 | ## 评估指标 ### 环境配置 bash git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git cd Ming-Freeform-Audio-Edit pip install -r requirements.txt **注意**:请从[HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main)或[ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files)下载音频与元数据文件,并将`wavs`和`meta`目录放置于`Ming-Freeform-Audio-Edit`目录下。 ### 语义编辑 针对删除、插入与替换任务,我们采用四项核心指标评估模型性能: + 编辑区域词错误率(Word Error Rate,WER,简称wer) + 非编辑区域词错误率(Word Error Rate,WER,简称wer.noedit) + 编辑操作准确率(Edit Operation Accuracy,简称acc) + 说话人相似度(Speaker Similarity,简称sim) 1. 若您的编辑后音频波形目录结构如下所示: eval_path | ├── del │ └── edit_del_basic │ └── tts/ # 此处为存放编辑后音频文件的实际目录 ├── ins │ └── edit_ins_basic │ └── tts/ # 此处为存放编辑后音频文件的实际目录 ├── sub └── edit_sub_basic └── tts/ # 此处为存放编辑后音频文件的实际目录 则可运行以下命令获取上述指标: bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_semantic.sh eval_path whisper_path paraformer_path wavlm_path eval_mode lang 下文为该脚本参数的简要说明: + `eval_path`:包含各编辑任务子目录的顶级目录 + `whisper_path`:Whisper模型的路径,用于计算英文音频的词错误率,可从[此处](https://huggingface.co/openai/whisper-large-v3)下载 + `paraformer_path`:Paraformer模型的路径,用于计算中文音频的词错误率,可从[此处](https://huggingface.co/funasr/paraformer-zh)下载 + `wavlm_path`:WavLM模型的路径,用于计算说话人相似度,可从[此处](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view)下载 + `eval_mode`:用于指定使用哪个版本的测试集,可选值为`basic`(基础版)与`open`(开放版) + `lang`:支持的语言,可选值为`zh`(中文)与`en`(英文) 2. 若您的编辑后音频目录未按上述格式组织,则可运行以下命令: bash cd eval_scripts # 计算wer、wer.noedit bash cal_wer_edit.sh meta_file wav_dir lang num_jobs res_dir task_type eval_mode whisper_path paraformer_path edit_cat # 此处设置为`semantic` # 计算sim bash cal_sim_edit.sh meta_file wav_dir wavlm_path num_jobs res_dir lang 下文为上述命令参数的简要说明: + `meta_file`:对应任务的元数据文件的绝对路径(例如`meta_en_deletion_basic.csv`或`meta_en_deletion.csv`) + `wav_dir`:存放编辑后音频文件的目录(WAV文件需直接置于该目录下) + `lang`:可选`zh`(中文)或`en`(英文) + `num_jobs`:并行进程数 + `res_dir`:用于保存指标结果的目录 + `task_type`:任务类型,可选`del`(删除)、`ins`(插入)或`sub`(替换) + `eval_mode`:与前文定义一致 + `whisper_path`:与前文定义一致 + `paraformer_path`:与前文定义一致 + `edit_cat`:编辑类别,可选`semantic`(语义编辑)或`acoustic`(声学编辑) ### 声学编辑 针对声学编辑任务,我们采用词错误率(WER)与说话人相似度(SPK-SIM)作为核心评估指标。 1. 若编辑后音频目录结构规范,则可运行以下命令: bash cd Ming-Freeform-Audio-Edit/eval_scripts bash run_eval_acoustic.sh eval_path whisper_path paraformer_path wavlm_path eval_mode lang 2. 反之,您可运行与语义编辑任务类似的命令,仅需将`edit_cat`参数设置为`acoustic`(声学编辑)。 此外,针对方言转换与情感转换任务,我们通过调用大语言模型(LLM)的API来评估转换准确率,详细步骤可参考`eval_scripts/run_eval_acoustic.sh`文件。
提供机构:
maas
创建时间:
2025-09-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作