VOX-DUB

Name: VOX-DUB
Creator: maas
Published: 2025-12-05 16:50:35
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/toloka/VOX-DUB

下载链接

链接失效反馈

官方服务：

资源简介：

VOX-DUB is a human-based benchmark for evaluating AI dubbing systems. It includes: - Audio fragments with original speech from real videos and their corresponding translated texts. - Generated audio recordings produced by multiple dubbing/TTS systems. - Human annotation results with pairwise A/B (+ SAME) evaluations across five aspects (pronunciation, naturalness, sound quality, emotion similarity, and voice similarity). - Detailed annotation guidelines with examples for pairwise A/B comparisons. Systems under evaluation are expected to generate the provided translations conditioned on the original speech audio. Their outputs are then compared against each other. The benchmark assesses dubbing quality across five key aspects: **pronunciation, naturalness, sound quality, emotion similarity, and voice similarity**. # Dataset structure This repository exposes three datasets: - **`source_data`** — original utterances with translations and speaker context - **`synthesized_data`** — audios generated by different TTS/dubbing systems - **`annotations`** — human A/B (+ SAME) pairwise judgments across five aspects --- ## Loading ```python from datasets import load_dataset source_data = load_dataset("toloka/vox-dub", name="source_data")["train"] synthesized_data = load_dataset("toloka/vox-dub", name="synthesized_data")["train"] annotations = load_dataset("toloka/vox-dub", name="annotations")["train"] ``` ## Source data `source_data` contains original speech segments and metadata for generation/evaluation. ### Features - **`utterance_id`** *(string)* — unique identifier of the utterance - **`source_language`** *(string)* — original language code (ISO-639-1), e.g. `"de"` - **`translation_en`** *(string)* — English translation of the utterance - **`translation_es`** *(string)* — Spanish translation of the utterance - **`original_utterance_audio`** *(Audio)* — audio clip of the original spoken utterance - **`other_speaker_utterances`** *(Sequence[Audio])* — additional utterances by the **same** speaker/character (useful for speaker conditioning / voice cloning) Audio columns are Hugging Face `Audio` features with on-the-fly decoding. ## Synthesized data `synthesized_data` are system outputs for the target dubbing task. ### Features - **`utterance_id`** *(string)* — links back to `source_data.utterance_id` - **`system`** *(string)* — name/identifier of the synthesis provider/model - **`language`** *(string)* — generation language code (ISO-639-1) - **`audio`** *(Audio)* — synthesized audio for the target line Multiple rows can exist per `utterance_id` (different systems and/or languages). ## Annotations `annotations` are human pairwise A/B (+ SAME) annotations across five aspects. ### Features - **`utterance_id`** *(string)* — evaluated utterance (ties to `source_data`) - **`language`** *(string)* — comparison language for this judgment - **`system_A`** *(string)* — first system in the pair - **`system_B`** *(string)* — second system in the pair - **`user`** *(string)* — anonymized annotator ID - **`pronunciation`** *(string ∈ {`"A"`, `"B"`, `"SAME"`})* — preference on pronunciation - **`naturalness`** *(string ∈ {`"A"`, `"B"`, `"SAME"`})* — preference on naturalness - **`sound_quality`** *(string ∈ {`"A"`, `"B"`, `"SAME"`})* — preference on audio quality - **`emotion_similarity`** *(string ∈ {`"A"`, `"B"`, `"SAME"`})* — preference on emotion similarity - **`voice_similarity`** *(string ∈ {`"A"`, `"B"`, `"SAME"`})* — preference on voice similarity Each row is one A/B comparison for a single `utterance_id`, `language`, `system_A` and `system_B` by signle `user`. You can aggreate labels from different annota with majority vote, Dawid–Skene, etc. --- # Guidelines for annotators Each entry contains three short audio samples: an original (reference) audio and its translation, read aloud by two different speech synthesis systems (Audio A and Audio B). Please listen to **Audio A** and **Audio B** and decide which one is better. There are five separate parameters for evaluation: **pronunciation**, **naturalness**, **sound quality**, **emotion similarity** and **voice similarity**. To evaluate the audio samples on the first two parameters (pronunciation and naturalness) please **only** listen to Audio A and Audio B and compare them. Ignore the reference completely. To evaluate sound quality, it is also **usually enough** to listen to Audio A and Audio B. To evaluate emotion similarity and voice similarity, you need to compare both samples to the reference audio. For each parameter, please choose which of the systems performs best: A or B, otherwise you can choose option SAME if the samples are equally bad or good. **How to make a decision: a life hack that we find useful** If both samples have faults and you are not sure which one is better, ask yourself: if I were doing a real dubbing project, and I had to choose one of these two samples to paste into my video, which one (naturalness-wise, or pronunciation-wise, or based on other parameter you are currently struggling with) would I choose? In tricky cases, when the answer is not evident, **explain your decision** in the comments. **Do not overthink it!** If you have listened to the samples 2-3 times and still can’t choose one of them, then they can be safely considered equal. ## 💬 Pronunciation Play the samples and compare them to the **text**. Are all the words in the text pronounced correctly? Aspects to pay attention to: - Was all the text pronounced, or is something missing? - Correct pronunciation of words/syllables/sounds, - Word stress, - Non-native accent or accent of a different region. - Phoneme reduction is not a mistake if widely used. - If the phrase seems to be pronounced correctly, but is difficult to make out due to reverberation, distortions and “bad microphone” effect, it is a problem of Sound quality, not Pronunciation. ⚠️ **Challenging examples:** `This happens every day.` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Accent_Bad.wav" type="audio/wav"> </audio> <figcaption>Non-native accent should be penalized.</figcaption> </figure> `If you recommend them in the comments.` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/sound_is_lost.wav" type="audio/wav"> </audio> <figcaption> ”You” and plural form “s” were lost. This mistake should be penalized. </figcaption> </figure> `You cannot fool a father's heart.` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Muffled_Good.wav" type="audio/wav"> </audio> <figcaption>The voice in Audio A is clearer.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Muffled_Bad.wav" type="audio/wav"> </audio> <figcaption>Audio B sounds muffled, but all the sounds seem to be in place. This is an issue of <b>Sound quality</b>, not Pronunciation</figcaption> </figure> </div> ## 🌿 Naturalness Does the **speech** in the samples sound natural? Does it sound like real human speech, or could you suspect a robot? ‼️ **Ignore reference audio** while assessing on this parameter. Aspects to pay attention to: - Correct usage of affirmative and interrogative intonation, - Logical stress in the sentence, - Robotic, monotonous, unnatural intonation, - Unusually slow or fast speech, - Pauses within the sentence - may be natural or not, - Breathing, snorts or other human sounds - may occur in natural places or not. ⚠️ **Challenging examples:** `My king, sometimes people only want to hear good things from Paro.` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/unnatural_emphasis.wav" type="audio/wav"> </audio> <figcaption>Emphasis on “my” instead of “king” sounds unnatural</figcaption> </figure> `He is hitting on my wife.` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Interrog_Bad.wav" type="audio/wav"> </audio> <figcaption>The intonation suggests a question or an unfinished sentence</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Interrog_Good.wav" type="audio/wav"> </audio> <figcaption>Affirmative intonation matches better</figcaption> </figure> </div> `You will also return all his belongings.` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/robotic_intonation.wav" type="audio/wav"> </audio> <figcaption>Inhalation at the start sounds quite natural, no penalty for that. However, the intonation is not dropping down enough and sounds a bit robotic.</figcaption> </figure> `Serve me, can't you see I'm hungry?` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Naturalness_12_A.wav" type="audio/wav"> </audio> <figcaption>The intonation here is flat and unnatural.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Naturalness_12_B.wav" type="audio/wav"> </audio> <figcaption>The intonation is much more natural and human-like. Audio B wins.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Naturalness_12_Ref.wav" type="audio/wav"> </audio> <figcaption>The reference is actually closer to Audio A, but we <b>ignore it</b> when judging Naturalness.</figcaption> </figure> </div> ## 🎧 Sound quality **Please use your headphones** to assess the sound quality! To understand what we consider good sound quality, imagine that you are working on a real dubbing project. The version that you pick will be pasted in your translated video over all its background sounds. The best possible option (in terms of sound quality) is to have clean, rich, studio sound. But sometimes we are ready to accept some non-studio sound features, if they convey interesting properties of the original voice and make our dubbing more life-like. The voice sounds differently if you hear it over the phone, if the speaker is at a distance, at the bottom of a well, in an empty room, or has a cold, or if the speaker is a ghost and talks with an eerie echo. Basically, the speaker can even be an actual robot with a very mechanic voice. So, if the original voice has such interesting peculiarities, we: - Don’t penalize for them in Sound quality (consider them as good as studio quality voice) - Slightly encourage them in Voice similarity. This is our general logic, but for your convenience we also wrote a set of rules that may help you if you prefer formal approach. 1. **Clear studio quality sound is always good**, even if the reference audio has issues and peculiarities. 2. These issues are **always** considered **bad** and should be penalized, because they would always make our dubbing worse: - Background noise - Claps, clicks and other non-speech artefacts. 3. These and similar issues are **usually** considered bad: - Distorted, mechanic voice, - Flat sound, like over an old telephone, or from a distance, - Bad microphone effect: as if a person pronounced all the sounds correctly, but they were corrupted by a bad microphone, - Echo If you hear these issues, **please check**: do they mimic an interesting feature of the original reference? Do they make our dubbing better? * **YES,** they mimic the original speaker, and I would like to hear them in my dubbing ➡️ Don’t penalize for them. * **YES**, they kind of mimic the original speaker, but I still think they would make dubbing worse ➡️ Penalize for them. * **NO,** they occur only in translation ➡️ They are bugs, penalize for them. ⚠️ **Challenging examples:** `But don’t tell Mom, ok? It’s our secret.` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/sound_clap.wav" type="audio/wav"> </audio> <figcaption>Random clap sound at the start should be penalized. You should be able to hear it with your headphones.</figcaption> </figure> `That’ s your doing!` <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Distorted.wav" type="audio/wav"> </audio> <figcaption>This is an example of a slightly distorted voice.</figcaption> </figure> `Is it a piece of cake?` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Piece_of_cake_A.wav" type="audio/wav"> </audio> <figcaption>This sample has a slightly more clear voice than the other..</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/less_clear_sound.wav" type="audio/wav"> </audio> <figcaption>Less clear sound</figcaption> </figure> </div> `And that I wouldn't be alone anymore.` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Background_music_Bad.wav" type="audio/wav"> </audio> <figcaption>Background hissing and music should be penalized</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Background_music_Good.wav" type="audio/wav"> </audio> <figcaption>This sample wins…</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Background_music_Reference.wav" type="audio/wav"> </audio> <figcaption>…even if the same background sounds are present in the original.</figcaption> </figure> </div> `Alright.` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Sound_quality_6_A.wav" type="audio/wav"> </audio> <figcaption>The voice is croaky, and it may seem a sound quality bug, but if we listen to the reference, we hear that it perfectly recreates the timber of the original speaker.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Sound_quality_6_B.wav" type="audio/wav"> </audio> <figcaption>The two samples are equal in terms of Sound quality, but B loses in Voice similarity.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Sound_quality_6_Ref.wav" type="audio/wav"> </audio> <figcaption>Original audio.</figcaption> </figure> </div> `This happens every day.` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Echo_A.wav" type="audio/wav"> </audio> <figcaption>Audio A has the same reverberation as the reference. It would sound good in dubbing, don’t penalize for that.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Echo_B.wav" type="audio/wav"> </audio> <figcaption>Audio B has reverberation as well, and we agreed not to penalize for that. However, there is a slight background hissing, that should be penalized.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Echo_Ref.wav" type="audio/wav"> </audio> <figcaption>The reference has some natural reverberation.</figcaption> </figure> </div> ## 🙀 Emotion similarity Listen to the samples **and the reference audio.** How well do the samples reproduce the emotion of the reference? Aspects to pay attention to: - Emotion of the speaker (sad, angry, happy, calm, soft, loud) - Expressiveness of both audios (narration vs. spontaneous speech), - Interrogative intonation should be assessed by the “naturalness” parameter, not emotion similarity. ⚠️ **Challenging examples:** `What are you looking for?` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_Ref.wav" type="audio/wav"> </audio> <figcaption>Reference.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_Good.wav" type="audio/wav"> </audio> <figcaption>Rushed, agitated speech is a better match.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_Bad.wav" type="audio/wav"> </audio> <figcaption>Calm speech is a worse match.</figcaption> </figure> </div> `Am I a victim?` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_victim_Ref.wav" type="audio/wav"> </audio> <figcaption>Reference.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_victim_Bad.wav" type="audio/wav"> </audio> <figcaption>Calmer, colder speech matches worse.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Emotion_victim_Good.wav" type="audio/wav"> </audio> <figcaption>The difference is slight, but this audio is more emotional, and matches better</figcaption> </figure> </div> ## 👥 Voice similarity Listen to the samples **and the reference audio.** How similar are the voices in Audio A and Audio B to the reference speaker’s voice? Aspects to pay attention to: - Timbre, - Pitch, - Estimated age and gender, - Other voice characteristics (is it far or near? is it in the same room or over the phone? does it have reverberation?). They are less important that the other aspects listed here, but they can be viewed as a bonus. ⚠️ **Challenging examples:** `Got a problem with that?` <div style="display: flex; gap: 20px; flex-wrap: wrap;"> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Voice_similarity_old_Ref.wav" type="audio/wav"> </audio> <figcaption>Reference. When we hear it, we imagine a senior woman.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Voice_similarity_old_Bad.wav" type="audio/wav"> </audio> <figcaption>This voice sounds younger.</figcaption> </figure> <figure style="margin:0; width:250px; text-align:center;"> <audio controls style="width:100%;"> <source src="https://huggingface.co/datasets/toloka/VOX-DUB/resolve/main/instruction/Voice_similarity_old_Good.wav" type="audio/wav"> </audio> <figcaption>This voice matches better .</figcaption> </figure> </div> # Reference ``` @misc{toloka2025vox-dub, title = {VOX-DUB: a new benchmark that puts AI dubbing to the test}, author = {{Toloka team}}, howpublished = {\url{https://toloka.ai/blog/ai-dubbing-benchmark/}}, year = {2025}, month = sep # "~9", note = {Accessed: 2025-09-10}, } ```

提供机构：

maas

创建时间：

2025-09-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集