CommonPhone-SE
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/CommonPhone-SE
下载链接
链接失效反馈官方服务:
资源简介:
# CommonPhone-SE
<!-- Provide a quick summary of the dataset. -->
Multilingual, age and gender balanced subset for speech enhancement benchmark.
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
Commonphone-SE is a benchmark dataset derived from Commonphone. It contains audio samples from 7 languages in the age range from 18 to 80. It aims to
provide a speaker diverse dataset to benchmark speech enhancement algorithms in real world conditions.
- **Curated by:** LangTech Lab members from the speech team.
- **Language(s) (NLP):** CA, DE, EN, IT, FR, RU, ES
- **License:** cc0-1.0
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
CommonPhone-SE is a subset of CommonPhone.
- **Repository:** https://zenodo.org/records/5846137
- **Paper :** https://arxiv.org/abs/2201.05912
### Languages
Catalan(CA), Deutsch(DE), English(EN), Italian(IT), French(FR), Russian(RU), Spanish(ES)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
The goal of this dataset is to evaluate the generalization capabilities of speech enhancement models in a real world multilingual and diverse dataset.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
The dataset consists of a single split, providing audios, transcriptions and demographic information
```
Dataset({
features: ['filename', 'gender', 'age', 'language', 'text', 'audio'],
num_rows: 5242
})
```
Each data point is structured as:
```
{'filename': 'common_voice_ca_31498257',
'gender': 'female',
'age': 'fifties',
'language': 'ca',
'text': 'lieutenant monroe va resultar ferit durant la batalla i va servir posteriorment al congrés',
'audio': {'path': 'Commonphone-SE/common_voice_ca_31498257.wav',
'array': array([0., 0., 0., ..., 0., 0., 0.]),
'sampling_rate': 16000}}
```
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
The sampling rationale was to select audios that remain difficult for state of the art enhancement models, both in terms of speech quality metrics and content preservation, hence, we selected the worst 40 examples w.r.t. to UTMOS, SCOREQ and WIL per each language, age band and gender. Finally, the duplicates were dropped to arrive at a final evaluation dataset of 8.24 hours.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
Crowdsourced audios recorded by volunters for CommonVoice that were selected in the CommonPhone dataset.
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU)
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
Like for Common Voice, you must not make any attempt to identify speakers that contributed to CommonPhone-SE.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The dataset was built trying to mitigate the bias on gender and age variables, however, it can still be biased towards the degradations found in the commonvoice corpus. Althoug this dataset has a lot of diversity the style is only reading speech.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
If you use the dataset please cite
**BibTeX:**
```
@inproceedings{giraldo25_interspeech,
title = {{Evaluating Speech Enhancement Performance Across Demographics and Language}},
author = {{Jose Giraldo and Alex Peiró-Lilja and Carme Armentano-Oller and Rodolfo Zevallos and Cristina España-Bonet}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{1353--1357}},
doi = {{10.21437/Interspeech.2025-1760}},
issn = {{2958-1796}},
}
```
## Dataset Card Authors
## Funding
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/) and also by the Ministerio para la Transformación
Digital y de la Función Pública and Plan de Recuperación,Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337
## Dataset Card Contact
langtech@bsc.es
# CommonPhone-SE
<!-- 提供数据集快速摘要。 -->
用于语音增强(speech enhancement)基准测试的多语言、年龄与性别均衡子集。
## 数据集详情(Dataset Details)
### 数据集描述(Dataset Description)
<!-- 提供数据集的详细说明。 -->
CommonPhone-SE是源自CommonPhone的语音增强基准数据集。该数据集包含7种语言的音频样本,参与者年龄覆盖18至80岁区间,旨在构建具备说话人多样性的数据集,用于真实场景下的语音增强算法基准测试。
- **整理方:** 语音团队所属LangTech Lab成员。
- **涉及语言(自然语言处理):** 加泰罗尼亚语(Catalan, CA)、德语(Deutsch, DE)、英语(English, EN)、意大利语(Italian, IT)、法语(French, FR)、俄语(Russian, RU)、西班牙语(Spanish, ES)
- **许可协议:** CC0 1.0
### 数据集来源(Dataset Sources)
<!-- 提供数据集的基础链接。 -->
CommonPhone-SE是CommonPhone的子集。
- **代码仓库:** https://zenodo.org/records/5846137
- **相关论文:** https://arxiv.org/abs/2201.05912
### 语言覆盖(Languages)
加泰罗尼亚语(Catalan, CA)、德语(Deutsch, DE)、英语(English, EN)、意大利语(Italian, IT)、法语(French, FR)、俄语(Russian, RU)、西班牙语(Spanish, ES)
## 数据集用途(Uses)
<!-- 说明数据集的预期使用场景。 -->
本数据集旨在于真实场景下的多语言多样化数据集中,评估语音增强模型的泛化能力。
## 数据集结构(Dataset Structure)
<!-- 本节描述数据集字段,以及数据集划分标准、数据点间关系等结构相关信息。 -->
本数据集仅包含单一划分,提供音频、转录文本与人口统计信息。
Dataset({
features: ['filename', 'gender', 'age', 'language', 'text', 'audio'],
num_rows: 5242
})
每条数据的结构如下:
{'filename': 'common_voice_ca_31498257',
'gender': 'female',
'age': 'fifties',
'language': 'ca',
'text': 'lieutenant monroe va resultar ferit durant la batalla i va servir posteriorment al congrés',
'audio': {'path': 'Commonphone-SE/common_voice_ca_31498257.wav',
'array': array([0., 0., 0., ..., 0., 0., 0.]),
'sampling_rate': 16000}}
## 数据集创建(Dataset Creation)
### 整理逻辑(Curation Rationale)
<!-- 数据集创建的动机。 -->
采样逻辑为挑选出对于当前最优(state-of-the-art)增强模型仍具有挑战性的音频样本,兼顾语音质量指标与内容保留度;具体而言,我们针对每种语言、年龄组与性别,筛选出UTMOS、SCOREQ和WIL指标表现最差的40条样本。最终剔除重复样本,得到总时长8.24小时的最终评估数据集。
### 源数据(Source Data)
<!-- 本节描述源数据类型,例如新闻文本与标题、社交媒体帖子、译句等。 -->
由志愿者为CommonVoice录制的众包音频,且已被纳入CommonPhone数据集。
#### 数据收集与处理(Data Collection and Processing)
<!-- 本节描述数据收集与处理流程,例如数据筛选标准、过滤与归一化方法、使用的工具与库等。 -->
#### 源数据生产者(Who are the source data producers?)
<!-- 本节描述原始创建数据的个人或系统。若可用,还应包含源数据创建者自行上报的人口统计或身份信息。 -->
CommonPhone由埃尔朗根-纽伦堡弗里德里希-亚历山大大学(Friedrich-Alexander-University Erlangen-Nuremberg, FAU)模式识别实验室的语音研究人员维护与分发。
#### 个人与敏感信息(Personal and Sensitive Information)
<!-- 说明数据集是否包含可能被视为个人、敏感或隐私的数据(例如泄露地址、唯一可识别的姓名或别名、种族或族裔出身、性取向、宗教信仰、政治观点、财务或健康数据等)。若已对数据进行匿名化处理,请描述匿名化流程。 -->
与CommonVoice一致,请勿尝试识别向CommonPhone-SE贡献音频的说话人身份。
## 偏差、风险与局限性(Bias, Risks, and Limitations)
<!-- 本节旨在说明技术与社会技术层面的局限性。 -->
本数据集在构建时试图缓解性别与年龄维度的偏差,但仍可能受CommonVoice语料库中存在的音频劣化影响而带有偏差。尽管本数据集具备较高多样性,但仅涵盖朗读式语音。
## 引用(Citation)
<!-- 若有介绍该数据集的论文或博客文章,应在此处给出APA和Bibtex格式的引用信息。 -->
若您使用本数据集,请引用以下文献:
**BibTeX:**
@inproceedings{giraldo25_interspeech,
title = {{Evaluating Speech Enhancement Performance Across Demographics and Language}},
author = {{Jose Giraldo and Alex Peiró-Lilja and Carme Armentano-Oller and Rodolfo Zevallos and Cristina España-Bonet}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{1353--1357}},
doi = {{10.21437/Interspeech.2025-1760}},
issn = {{2958-1796}},
}
## 数据集卡片作者(Dataset Card Authors)
## 资助信息(Funding)
本项目获加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)资助,同时获西班牙数字化与公共职能部及复苏、转型与韧性计划(由欧盟下一代EU基金支持)旗下ILENIA项目资助,项目编号2022/TL22/00215337。
## 数据集卡片联系人(Dataset Card Contact)
langtech@bsc.es
提供机构:
maas
创建时间:
2025-10-28



