Multilingual Speaker Anonymization Trials for CommonVoice and Multilingual LibriSpeech

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12801026

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains the speaker verification trial files of the evaluation data splits for Multilingual LibriSpeech (MLS) and CommonVoice (CV) that we propose in our paper "Probing the Feasibility of Multilingual Speaker Anonymization". The actual audio files are not included and have to be obtained separately, following the licenses of the respective corpus creators. The files in this dataset only contain the audio file IDs that can be used to prepare the data for the evaluation. Data All files contain the utterance IDs of the original MLS and CV corpora which makes it possible to align them to the audio files as provided by the dataset creators. If you use these datasets, you need to cite the original sources. We do not claim any rights to the audios. The dataset does not include all IDs of the corpora. For MLS, "dev" and "test" correspond to the dev and test splits as provided in the MLS corpus. For CV, the corpus was divided randomly into dev and test while ensuring no speaker overlap between both splits. We use the CV 16.1 version of the corpus and restrict it to validated audios where the user specified their gender as either female or male. Further information about the data restrictions can be found in the paper. Please note that we use the "client ID" of CV to distinguish between speakers. We acknowledge that this can lead to having the same speaker under different name multiple times in the corpus if they were assigned different client IDs. Based on our results for the original (non-anonymized) data, we believe this to have only little effect on our evaluation data. The following languages are included in this dataset: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it), Portuguese (pt), Polish (pl), and Russian (ru). Structure The directory contains separate folders for MLS and CV, with subfolders for each language. Each language subfolder contains 7 files: dev_enrolls dev_trials_f dev_trials_m test_enrolls test_trials_f test_trials_m utt2spk The file structures follow the evaluation data of the Voice Privacy Challenges (https://www.voiceprivacychallenge.org). The "enrolls" file contain the list of utterance IDs (i.e., audio files) that are used for enrollment of the speaker verification model. "trials_f" and "trials_m" correspond to the trial files for female and male speakers, respectively. Each line in a trial file consists of three constituents, separated by space: "enrollment speaker" "trial utterance" "target/nontarget". The last constituent signals whether the trial utterance was originally (i.e., before the anonymization) spoken by the enrollment speaker (target) or not (nontarget). The utt2spk file contains the true mapping between utterance and original speaker. This file is especially important for the CV corpus where we created new speaker names to replace the long client IDs, and where the speaker assignment is not visible in the file name. Creation Process For the preparation of the data into enrollment and trial subset, we tried to follow the dev and test files of the Voice Privacy Challenge 2022. This results in far more nontarget than target trials, and additional speakers in the trial set that are not contained in the enrollment set. MLS The MLS corpus (available here) already comes with a split into train / dev / test, which we reuse here. The data is further divided into 8 languages: en, de, nl, fr, es, it, pt and pl. We only take the dev and test sets for 6 languages (de, nl, fr, es, it, and pt). For en, we already have an alternative from the Voice Privacy Challenges based on the monolingual English LibriSpeech, so there is no need for another part from MLS. For pl, the MLS part only contains 2 speakers per gender and dev / test split, which is too small for effective speaker verification. (Sidenote: Dutch is not significantly bigger with only 3 speakers per gender and split, but we decided to keep it anyway). We do not further restrict the number of utterances or speaker in each language and dev / test set, which results in a large inbalance across languages. However, as given in the original MLS splits, the languages itself are balanced in terms of gender. CV Mozilla's CommonVoice data collection is significantly bigger than MLS, so we could select more speakers per language and make sure that the datasets per language were more or less balanced. We take the data from CV Version 16.1 (available here). Please note that users who donated their voice to the data collection can opt out of being included in the data at any point. This means that some speakers or utterances contained in our trial data might be missing in future downloads of the CV corpus. We further want to mention that we had to use the field client ID in the CV corpus for speaker assignment which is not fully accurate. The same speaker might end up with several client IDs if they are recording the utterances in different sessions. However, for the purpose of speaker anonymization, this issue is not as relevant as for pure speaker recognition. We use CV for all of our 9 languages. CV does not come with a division into train / dev / test splits, so we randomly sample speakers and utterances from it. For this sampling, we consider only speakers that have gender annotated as either female or male, and have recorded at least 50 validated audios. We further make sure that we have the same number of female as male speakers which leads for several languages to a large reduction in size, with most languages having far more male speakers in CV than female speakers. This leads especially to a smaller dataset for pl, for which only 14 speakers per gender and dev / test split are available. We randomly select at most 20 speakers per gender for each dev and test in each language, and randomly choose up to 70 utterances per speaker. As our lower bound for speaker selection was originally 50 utterances per speaker, this results in 50-70 utterances per speaker. Separation into Enrollment and Trials In MLS, we use all speakers for the enrollment and trial. The only exception is de, for which more speakers are available. In the MLS-de data, we reserve 5 speakers per gender and split for trial only, which creates some unseen distraction speakers in the trial data. In CV, we use 15 speakers per gender and split for enrollment (except for the smaller pl part, for which it is only 10), and also reserve up to 5 speakers per gender and split only for trial. Naturally, all enrollment speakers are also used in trial. 15% of all utterances of a speaker (at least 5 utterances) are used as enrollment utterances, the rest for trial. All trial utterances are paired with each enrollment speaker of the respective gender. If an enrollment speaker is the actual speaker of that utterance, this is denoted as target, otherwise as nontarget. During the trials, the enrollment speaker is modeled as an average of the speaker embeddings of all enrollment utterances of that speaker. Statistics The following section displays the statistics for each dataset, language and dev / test split. In these statistics, female and male speakers are not distinguished, but the numbers are balanced for each subset. The following information is given for each dataset and language: # speakers: number of speakers used for both enrollment and trial (50% female / 50% male) # add.trial speakers: number of speakers additionally used only in trial # enroll utts: total number of utterances used in enrollment (across all speakers) # trial utts: total number of utterances used in trials (across all speakers) # target trials: number of target trials (enrollment speaker == trial speaker) # nontarget trials: number of nontarget trials (enrollment speaker != trial speaker) # words: total number of words across all trial utterances (the WER is computed based on them) # avg. utt length: average length of all utterances in the dataset, in seconds Development Data Total dataset statistics: Dataset Lang # speakers # add. trial speakers # enroll utts # trial utts # target trials # nontarget trials # words avg. utt length MLS de 20 10 333 3,136 1,936 29,424 111,245 14.90 fr 18 0 354 2,062 2,062 16,496 73,007 15.03 it 10 0 183 1,065 1,065 4,260 34,636 14.88 es 20 0 349 2,059 2,059 18,531 74,782 14.95 pt 10 0 119 707 707 2,828 24,733 15.90 nl 6 0 461 2,634 2,634 5,268 11,0384 14.83 CV en 30 10 279 2,306 1,691 32,899 22,394 4.96 de 30 9 289 2,396 1,738 34,202 21,421 5.16 fr 30 10 293 2,432 1,761 34,719 23,240 4.93 it 30 10 286 2,421 1,722 34,593 23,745 5.69 es 30 10 290 2,401 1,735 34,280 22,569 5.23 pt 30 10 284 2,398 1,708 34,262 16,411 3.97 nl 30 10 288 2,401 1,725 34,290 22,108 4.51 pl 20 8 199 1,739 1,196 16,194 12,876 4.39 ru 30 10 289 2,429 1,737 34,698 20,599 5.24 Dataset statistics per speaker (average) Dataset Lang # enroll utts # trial utts # target trials # nontarget trials # words MLS de 16.6 104.5 96.8 1,471.2 3,553.0 fr 19.7 114.6 114.6 916.4 4,350.0 it 18.3 106.5 106.5 426.0 4,405.5 es 17.4 103.0 103.0 926.6 4,300.5 pt 11.9 70.7 70.7 282.8 2,084.0 nl 76.8 439.0 439.0 878.0 45,515.0 CV en 9.3 59.1 56.4 1,096.6 627.0 de 9.6 59.9 57.9 1,140.1 492.5 fr 9.8 60.8 58.7 1,157.3 606.0 it 9.5 60.5 57.4 1,153.1 594.5 es 9.7 60.0 57.8 1,142.7 494.0 pt 9.5 60.0 56.9 1,142.1 308.0 nl 9.6 60.0 57.5 1,143.0 520.5 pl 10.0 62.1 59.8 809.7 514.5 ru 9.6 60.7 57.9 1,156.6 502.0 Test Data Total dataset statistics: Dataset Lang # speakers # add. trial speakers # enroll utts # trial utts # target trials # nontarget trials # words avg. utt length MLS de 30 0 329 3,065 1,906 28,744 110,202 15.18 fr 18 0 357 2,069 2,069 16,552 79,524 14.94 it 10 0 185 1,077 1,077 4,308 34,796 15.07 es 20 0 348 2,037 2,037 18,333 75,536 15.11 pt 10 0 125 746 746 2,984 26,769 15.47 nl 6 0 458 2,617 2,617 5,234 108,489 14.96 CV en 30 9 289 2,344 1,733 33,427 22,560 5.09 de 30 10 284 2,377 1,713 33,942 21,492 5.24 fr 30 10 291 2,408 1,745 34,275 22,887 4.74 it 30 10 298 2,444 1,786 34,874 24,091 5.37 es 30 10 284 2,377 1,703 33,952 22,655 5.25 pt 30 7 290 2,213 1,743 31,452 15,727 4.23 nl 30 10 283 2,254 1,704 32,106 19,913 4.32 pl 20 8 196 1,712 1,181 15,939 13,809 4.82 ru 30 10 294 2,435 1,759 34,766 20,509 5.13 Dataset statistics per speaker (average) Dataset Lang # enroll utts # trial utts # target trials # nontarget trials # words MLS de 16.4 102.2 95.3 1,437.2 4,047.0 fr 19.8 114.9 114.9 919.6 3,945.5 it 18.5 107.7 107.7 430.8 3,658.0 es 17.4 101.8 101.8 916.6 4,870.5 pt 12.5 74.6 74.6 298.4 3,141.5 nl 76.3 436.2 436.2 872.3 24,892.5 CV en 9.6 60.1 57.8 1,114.2 551.5 de 9.5 59.4 57.1 1,131.4 568.5 fr 9.7 60.2 58.2 1,145.8 618.0 it 9.9 61.1 59.5 1,162.5 612.0 es 9.5 59.4 56.8 1,131.7 610.0 pt 9.7 59.7 58.1 1,048.4 409.5 nl 9.4 59.3 56.8 1,070.2 564.0 pl 9.8 61.1 59.0 797.0 515.0 ru 9.8 60.9 58.6 1,158.9 392.5 More Information Paper The paper in which this dataset is proposed will be published at Interspeech 2024. The preprint is available on arXiv: Meyer, Sarina, Florian Lux, and Ngoc Thang Vu. "Probing the Feasibility of Multilingual Speaker Anonymization." arXiv preprint arXiv:2407.02937 (2024). Code All code related to this data, including the data and descriptions, as well as preparation scripts to use this data for speaker anonymization, can be found in our Github repository: https://github.com/DigitalPhonetics/speaker-anonymization/tree/multilingual

创建时间：

2024-07-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集