Ming-Freeform-Audio-Edit-Benchmark
收藏魔搭社区2026-04-28 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
# README
## Introduction
This repository hosts Ming-Freeform-Audio-Edit, the benchmark test set for evaluating the downstream editing tasks of the Ming-UniAudio model.
This test set covers 7 distinct editing tasks, categorized as follows:
+ Semantic Editing (3 tasks):
+ Free-form Deletion
+ Free-form Insertion
+ Free-form Substitution
+ Acoustic Editing (5 tasks):
+ Time-stretching
+ Pitch Shifting
+ Dialect Conversion
+ Emotion Conversion
+ Volume Conversion
The audio samples are sourced from well-known open-source datasets, including seed-tts eval, LibriTTS, and Gigaspeech.
## Dataset statistics
### Semantic Editing
#### full version
| Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution |
| -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: |
| Index-based | 186 | 180 | 36 | 138 | 100 | 67 |
| Content-based | 95 | 110 | 289 | 62 | 99 | 189 |
| Total | 281 | 290 | 325 | 200 | 199 | 256 |
#### basic version
| Task Types\ # samples \ Language | Zh deletion | Zh insertion | Zh substitution | En deletion | En insertion | En substitution |
| -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: |
| Index-based | 92 | 65 | 29 | 47 | 79 | 29 |
| Content-based | 78 | 105 | 130 | 133 | 81 | 150 |
| Total | 170 | 170 | 159 | 180 | 160 | 179 |
*Index-based* instruction: specifies an operation on content at positions *i* to *j*. (e.g. delete the characters or words from index 3 to 12)
*Content-based*: targets specific characters or words for editing. (e.g. insert 'hello' before 'world')
### Acoustic Editing
| Task Types\ # samples \ Language | Zh | En |
| -------------------------------- | ---: | ---: |
| Time-stretching | 50 | 50 |
| Pitch Shifting | 50 | 50 |
| Dialect Conversion | 250 | --- |
| Emotion Conversion | 84 | 72 |
| Volume Conversion | 50 | 50 |
## Evaluation Metrics
### Environment Preparation
```bash
git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git
cd Ming-Freeform-Audio-Edit
pip install -r requirements.txt
```
**Note**: Please download the audio and meta files from [HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main) or [ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files) and put the `wavs` and `meta` directories under `Ming-Freeform-Audio-Edit`
### Semantic Editing
For the deletion, insertion, and substitution tasks, we evaluate the performance using four key metrics:
+ Word Error Rate (WER) of the Edited Region (wer)
+ Word Error Rate (WER) of the Non-edited Region (wer.noedit)
+ Edit Operation Accuracy (acc)
+ Speaker Similarity (sim)
1. If you have organized the directories contain edited waveforms like below:
```
eval_path
|
├── del
│ └── edit_del_basic
│ └── tts/ # This is the actual directory contains the edited wavs
├── ins
│ └── edit_ins_basic
│ └── tts/ # This is the actual directory contains the edited wavs
├── sub
└── edit_sub_basic
└── tts/ # This is the actual directory contains the edited wavs
```
Then you can run the following command to get those metrics:
```bash
cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_semantic.sh eval_path \
whisper_path \
paraformer_path \
wavlm_path \
eval_mode \
lang
```
Here is a brief description of the parameters for the script above:
+ `eval_path`: The top-level directory containing subdirectories for each editing task
+ `whisper_path`:Path to the Whisper model, which is used to calculate WER for English audio. You can download it from [here](https://huggingface.co/openai/whisper-large-v3).
+ `paraformer_path`:Path to the Paraformer model, which is used to calculate WER for Chinese audio. You can download it from [here](https://huggingface.co/funasr/paraformer-zh).
+ `wavlm_path`: Path to the WavLM model, which is used to calculate speaker similarity. You can download it from [here](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
+ `eval_mode`: Used to specify which version of the evaluation set to use. Choose between `basic` and `open`
+ `lang`: supported language, choose between `zh` and `en`
2. If your directory for the edited audio is not organized in the format described above, you can run the following commands.
```bash
cd eval_scripts
# get wer, wer.noedit
bash cal_wer_edit.sh meta_file \
wav_dir \
lang \
num_jobs \
res_dir \
task_type \
eval_mode \
whisper_path \
paraformer_path \
edit_cat # use `semantic` here
# get sim
bash cal_sim_edit.sh meta_file \
wav_dir \
wavlm_path \
num_jobs \
res_dir \
lang
```
Here is a brief description of the parameters for the script above:
+ `meta_file`: The absolute path to the meta file for the corresponding task (e.g., `meta_en_deletion_basic.csv` or `meta_en_deletion.csv`).
+ `wav_dir`: The directory containing the edited audio files (the WAV files should be located directly in this directory).
+ `lang`: `zh` or `en`
+ `num_jobs`: number of process.
+ `res_dir`: The directory to save the metric results.
+ `task_type`: `del`, `ins` or `sub`
+ `eval_mode`: The same as the above.
+ `whisper_path`: The same as the above
+ `paraformer_path`: The same as the above
+ `edit_cat`: `semantic` or `acoustic`
### Acoustic Editing
For the acoustic editing tasks, we use WER and SPK-SIM as the primary evaluation metrics.
1. If the directory for the edited audio is structured, you can run the following command.
```bash
cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_acoustic.sh eval_path \
whisper_path \
paraformer_path \
wavlm_path \
eval_mode \
lang
```
2. Otherwise, you can run commands similar to the one for the semantic tasks, with the `edit_cat` parameter set to `acoustic`.
Additionally, for the dialect and emotion conversion tasks, we assess the conversion accuracy by leveraging a large language model (LLM) through API calls, refer to `eval_scripts/run_eval_acoustic.sh` for more details.
# 说明文档(README)
## 简介
本仓库托管**Ming-Freeform-Audio-Edit**数据集,该数据集是用于评估Ming-UniAudio模型下游音频编辑任务的基准测试集。
该测试集涵盖7类不同的音频编辑任务,分类如下:
+ 语义编辑(3项子任务):
+ 自由式删除(Free-form Deletion)
+ 自由式插入(Free-form Insertion)
+ 自由式替换(Free-form Substitution)
+ 声学编辑(5项子任务):
+ 时间拉伸(Time-stretching)
+ 音调偏移(Pitch Shifting)
+ 方言转换(Dialect Conversion)
+ 情感转换(Emotion Conversion)
+ 音量转换(Volume Conversion)
该数据集的音频样本源自多个知名开源数据集,包括seed-tts eval、LibriTTS以及Gigaspeech。
## 数据集统计
### 语义编辑
#### 完整版
| 任务类型#样本数 语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 |
| -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: |
| 基于索引(Index-based) | 186 | 180 | 36 | 138 | 100 | 67 |
| 基于内容(Content-based) | 95 | 110 | 289 | 62 | 99 | 189 |
| 总计(Total) | 281 | 290 | 325 | 200 | 199 | 256 |
#### 基础版
| 任务类型#样本数 语言 | 中文删除 | 中文插入 | 中文替换 | 英文删除 | 英文插入 | 英文替换 |
| -------------------------------- | ----------: | -----------: | --------------: | ----------: | -----------: | --------------: |
| 基于索引(Index-based) | 92 | 65 | 29 | 47 | 79 | 29 |
| 基于内容(Content-based) | 78 | 105 | 130 | 133 | 81 | 150 |
| 总计(Total) | 170 | 170 | 159 | 180 | 160 | 179 |
**基于索引**指令:指定对位置*i*至*j*处的内容执行编辑操作(例如删除索引3至12处的字符或单词)。
**基于内容**:针对特定字符或单词执行编辑操作(例如在"world"前插入"hello")。
### 声学编辑
| 任务类型#样本数 语言 | 中文 | 英文 |
| -------------------------------- | ---: | ---: |
| 时间拉伸(Time-stretching) | 50 | 50 |
| 音调偏移(Pitch Shifting) | 50 | 50 |
| 方言转换(Dialect Conversion) | 250 | --- |
| 情感转换(Emotion Conversion) | 84 | 72 |
| 音量转换(Volume Conversion) | 50 | 50 |
## 评估指标
### 环境配置
bash
git clone https://github.com/inclusionAI/Ming-Freeform-Audio-Edit.git
cd Ming-Freeform-Audio-Edit
pip install -r requirements.txt
**注意**:请从[HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/tree/main)或[ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark/files)下载音频与元数据文件,并将`wavs`和`meta`目录放置于`Ming-Freeform-Audio-Edit`目录下。
### 语义编辑
针对删除、插入与替换任务,我们采用四项核心指标评估模型性能:
+ 编辑区域词错误率(Word Error Rate,WER,简称wer)
+ 非编辑区域词错误率(Word Error Rate,WER,简称wer.noedit)
+ 编辑操作准确率(Edit Operation Accuracy,简称acc)
+ 说话人相似度(Speaker Similarity,简称sim)
1. 若您的编辑后音频波形目录结构如下所示:
eval_path
|
├── del
│ └── edit_del_basic
│ └── tts/ # 此处为存放编辑后音频文件的实际目录
├── ins
│ └── edit_ins_basic
│ └── tts/ # 此处为存放编辑后音频文件的实际目录
├── sub
└── edit_sub_basic
└── tts/ # 此处为存放编辑后音频文件的实际目录
则可运行以下命令获取上述指标:
bash
cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_semantic.sh eval_path
whisper_path
paraformer_path
wavlm_path
eval_mode
lang
下文为该脚本参数的简要说明:
+ `eval_path`:包含各编辑任务子目录的顶级目录
+ `whisper_path`:Whisper模型的路径,用于计算英文音频的词错误率,可从[此处](https://huggingface.co/openai/whisper-large-v3)下载
+ `paraformer_path`:Paraformer模型的路径,用于计算中文音频的词错误率,可从[此处](https://huggingface.co/funasr/paraformer-zh)下载
+ `wavlm_path`:WavLM模型的路径,用于计算说话人相似度,可从[此处](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view)下载
+ `eval_mode`:用于指定使用哪个版本的测试集,可选值为`basic`(基础版)与`open`(开放版)
+ `lang`:支持的语言,可选值为`zh`(中文)与`en`(英文)
2. 若您的编辑后音频目录未按上述格式组织,则可运行以下命令:
bash
cd eval_scripts
# 计算wer、wer.noedit
bash cal_wer_edit.sh meta_file
wav_dir
lang
num_jobs
res_dir
task_type
eval_mode
whisper_path
paraformer_path
edit_cat # 此处设置为`semantic`
# 计算sim
bash cal_sim_edit.sh meta_file
wav_dir
wavlm_path
num_jobs
res_dir
lang
下文为上述命令参数的简要说明:
+ `meta_file`:对应任务的元数据文件的绝对路径(例如`meta_en_deletion_basic.csv`或`meta_en_deletion.csv`)
+ `wav_dir`:存放编辑后音频文件的目录(WAV文件需直接置于该目录下)
+ `lang`:可选`zh`(中文)或`en`(英文)
+ `num_jobs`:并行进程数
+ `res_dir`:用于保存指标结果的目录
+ `task_type`:任务类型,可选`del`(删除)、`ins`(插入)或`sub`(替换)
+ `eval_mode`:与前文定义一致
+ `whisper_path`:与前文定义一致
+ `paraformer_path`:与前文定义一致
+ `edit_cat`:编辑类别,可选`semantic`(语义编辑)或`acoustic`(声学编辑)
### 声学编辑
针对声学编辑任务,我们采用词错误率(WER)与说话人相似度(SPK-SIM)作为核心评估指标。
1. 若编辑后音频目录结构规范,则可运行以下命令:
bash
cd Ming-Freeform-Audio-Edit/eval_scripts
bash run_eval_acoustic.sh eval_path
whisper_path
paraformer_path
wavlm_path
eval_mode
lang
2. 反之,您可运行与语义编辑任务类似的命令,仅需将`edit_cat`参数设置为`acoustic`(声学编辑)。
此外,针对方言转换与情感转换任务,我们通过调用大语言模型(LLM)的API来评估转换准确率,详细步骤可参考`eval_scripts/run_eval_acoustic.sh`文件。
提供机构:
maas
创建时间:
2025-09-29



