Malaysian-Emilia-annotated
收藏魔搭社区2025-10-10 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/mesolitica/Malaysian-Emilia-annotated
下载链接
链接失效反馈官方服务:
资源简介:
# Malaysian Emilia Annotated
Annotate [Malaysian-Emilia](https://huggingface.co/datasets/mesolitica/Malaysian-Emilia) using [Data-Speech](https://github.com/huggingface/dataspeech) pipeline.
## Malaysian Youtube
1. Originally from [malaysia-ai/crawl-youtube](https://huggingface.co/datasets/malaysia-ai/crawl-youtube)
2. Total 3168.8 hours.
3. Gender prediction, [filtered-24k_processed_24k_gender.zip](filtered-24k_processed_24k_gender.zip)
4. Language prediction, [filtered-24k_processed_language.zip](filtered-24k_processed_language.zip)
5. 5. Force alignment.
6. Post cleaned to 24k and 44k sampling rates,
- 24k, [filtered-24k_processed_24k.zip](filtered-24k_processed_24k.zip)
- 44k, [filtered-24k_processed_44k.zip](filtered-24k_processed_44k.zip)
6. Synthetic description, [malaysian-emilia-youtube.parquet](data/malaysian_emilia_youtube-00000-of-00001.parquet),
```python
{'transcription': "Hey guys, assalamualaikum. It's me, Nina and welcome back to Nina's Story. Setiap negara ada undang-undang yang tersendiri untuk menghukum orang yang melakukan kesalahan.",
'gender': 'female',
'country': 'malaysian',
'utterance_pitch_mean': 218.09979248046875,
'utterance_pitch_std': 44.81846237182617,
'snr': 58.60026931762695,
'c50': 59.760154724121094,
'speech_duration': 9.365625381469727,
'stoi': 0.9753543138504028,
'si-sdr': 13.493837356567383,
'pesq': 2.6889467239379883,
'pitch': 'slightly low pitch',
'speaking_rate': 'slightly slowly',
'reverberation': 'very confined sounding',
'speech_monotony': 'very monotone',
'sdr_noise': 'slightly noisy',
'audio_filename': 'filtered-24k_processed_24k/00463-21/00463-21_0.mp3'}
```
**Prompt still on generating**.
## Malaysian Podcast
1. Originally from [malaysia-ai/crawl-youtube-malaysian-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysian-podcast)
2. Total 622.8 hours.
3. Gender prediction, [malaysian-podcast_processed_24k_gender.zip](malaysian-podcast_processed_24k_gender.zip)
4. Language prediction, [malaysian-podcast_processed_language.zip](malaysian-podcast_processed_language.zip)
5. Force alignment, [malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip)
6. Post cleaned to 24k and 44k sampling rates,
- 24k, [malaysian-podcast_processed_24k.zip](malaysian-podcast_processed_24k.zip)
- 44k, [malaysian-podcast_processed_44k.zip](malaysian-podcast_processed_44k.zip)
6. Synthetic description, [malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet),
```python
{'transcription': 'Cara nak apply, macam Puteri kan time internship. So, Puteri punya keluar dekat group internship, aa, dia keluar satu form.',
'gender': 'female',
'country': 'malaysian',
'utterance_pitch_mean': 259.931396484375,
'utterance_pitch_std': 46.01287841796875,
'snr': 41.81050491333008,
'c50': 59.3415641784668,
'speech_duration': 7.661250114440918,
'stoi': 0.9756626486778259,
'si-sdr': 20.618106842041016,
'pesq': 3.326802968978882,
'pitch': 'moderate pitch',
'speaking_rate': 'quite slowly',
'noise': 'moderate ambient sound',
'reverberation': 'very confined sounding',
'speech_monotony': 'very monotone',
'sdr_noise': 'almost no noise',
'audio_filename': 'malaysian-podcast_processed_44k/Cara Nak Apply Student Exchange [vFhLEniT9X8]/Cara Nak Apply Student Exchange [vFhLEniT9X8]_0.mp3',
'prompt': 'A Malaysian woman delivers a very monotone speech with a moderate pitch, speaking quite slowly in a very confined and almost noise-free environment.'}
```
## Singaporean Podcast
1. Originally from [malaysia-ai/crawl-youtube-singaporean-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-singaporean-podcast)
2. Total 175.9 hours.
3. Gender prediction, [sg-podcast_processed_24k_gender.zip](sg-podcast_processed_24k_gender.zip)
4. Language prediction, [sg-podcast_processed_language.zip](sg-podcast_processed_language.zip)
5. Force alignment, [malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip)
6. Post cleaned to 24k and 44k sampling rates,
- 24k, [sg-podcast_processed_24k.zip](sg-podcast_processed_24k.zip)
- 44k, [sg-podcast_processed_44k.zip](sg-podcast_processed_44k.zip)
6. Synthetic description, [malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet),
```python
{'transcription': "You just know, wherever you go in the world, the asshole is always in control. It's true.",
'gender': 'male',
'country': 'singaporean',
'utterance_pitch_mean': 124.18851470947266,
'utterance_pitch_std': 32.084354400634766,
'snr': 69.38728332519531,
'c50': 59.84521484375,
'speech_duration': 4.910624980926514,
'stoi': 0.9785327315330505,
'si-sdr': 16.752330780029297,
'pesq': 2.8572096824645996,
'pitch': 'very low pitch',
'speaking_rate': 'very slowly',
'noise': 'very clear',
'reverberation': 'very confined sounding',
'speech_monotony': 'very monotone',
'sdr_noise': 'slightly noisy',
'audio_filename': 'sg-podcast_processed_44k/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]_0.mp3',
'prompt': 'A Singaporean man speaks with a very monotone and very low-pitched voice, creating a very confined and slightly echo-y sound. The recording is slightly noisy but still allows for clear understanding.'}
```
## Malaysia Parliament
1. Originally from [malaysia-ai/crawl-youtube-malaysia-parliament](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysia-parliament)
2. Total 2317.9 hours.
3. Gender prediction.
4. Language prediction, [parlimen-24k-chunk_processed_language.zip](parlimen-24k-chunk_processed_language.zip)
5. Force alignment.
6. Post cleaned to 24k and 44k sampling rates,
- 24k, [parlimen-24k-chunk_processed_24k.zip](parlimen-24k-chunk_processed_24k.zip)
- 44k, [parlimen-24k-chunk_processed_44k.zip](parlimen-24k-chunk_processed_44k.zip)
6. Synthetic description, **Prompt still on generating**.
## Source code
All source code at https://github.com/mesolitica/malaysian-dataset/tree/master/text-to-speech/emilia-dataspeech
# 马来西亚Emilia标注数据集
使用[Data-Speech](https://github.com/huggingface/dataspeech)流水线对[Malaysian-Emilia](https://huggingface.co/datasets/mesolitica/Malaysian-Emilia)数据集进行标注。
## 马来西亚YouTube语料
1. 原始数据源自[malaysia-ai/crawl-youtube](https://huggingface.co/datasets/malaysia-ai/crawl-youtube)数据集
2. 总时长3168.8小时
3. 性别预测结果文件:[filtered-24k_processed_24k_gender.zip](filtered-24k_processed_24k_gender.zip)
4. 语言预测结果文件:[filtered-24k_processed_language.zip](filtered-24k_processed_language.zip)
5. 强制对齐(Force Alignment)
6. 后处理统一为24kHz与44kHz采样率:
- 24kHz版本:[filtered-24k_processed_24k.zip](filtered-24k_processed_24k.zip)
- 44kHz版本:[filtered-24k_processed_44k.zip](filtered-24k_processed_44k.zip)
7. 合成描述文件:[malaysian-emilia-youtube.parquet](data/malaysian_emilia_youtube-00000-of-00001.parquet)
以下为一条示例数据条目:
python
{'transcription': "Hey guys, assalamualaikum. It's me, Nina and welcome back to Nina's Story. 每个国家都有针对惩处违法者的专属法律。",
'gender': '女性',
'country': '马来西亚',
'utterance_pitch_mean': 218.09979248046875,
'utterance_pitch_std': 44.81846237182617,
'snr': 58.60026931762695,
'c50': 59.760154724121094,
'speech_duration': 9.365625381469727,
'stoi': 0.9753543138504028,
'si-sdr': 13.493837356567383,
'pesq': 2.6889467239379883,
'pitch': '基频略低',
'speaking_rate': '语速稍缓',
'reverberation': '混响特征为极封闭空间音效',
'speech_monotony': '语音单调性极强',
'sdr_noise': '存在轻微噪声',
'audio_filename': 'filtered-24k_processed_24k/00463-21/00463-21_0.mp3'}
**提示词仍在生成中**。
## 马来西亚播客语料
1. 原始数据源自[malaysia-ai/crawl-youtube-malaysian-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysian-podcast)数据集
2. 总时长622.8小时
3. 性别预测结果文件:[malaysian-podcast_processed_24k_gender.zip](malaysian-podcast_processed_24k_gender.zip)
4. 语言预测结果文件:[malaysian-podcast_processed_language.zip](malaysian-podcast_processed_language.zip)
5. 强制对齐(Force Alignment)结果文件:[malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip)
6. 后处理统一为24kHz与44kHz采样率:
- 24kHz版本:[malaysian-podcast_processed_24k.zip](malaysian-podcast_processed_24k.zip)
- 44kHz版本:[malaysian-podcast_processed_44k.zip](malaysian-podcast_processed_44k.zip)
7. 合成描述文件:[malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet)
示例数据条目:
python
{'transcription': "申请方式就像普特丽实习的时候那样对吧。当时普特丽在实习群里发了一份申请表。",
'gender': '女性',
'country': '马来西亚',
'utterance_pitch_mean': 259.931396484375,
'utterance_pitch_std': 46.01287841796875,
'snr': 41.81050491333008,
'c50': 59.3415641784668,
'speech_duration': 7.661250114440918,
'stoi': 0.9756626486778259,
'si-sdr': 20.618106842041016,
'pesq': 3.326802968978882,
'pitch': '基频适中',
'speaking_rate': '语速偏缓',
'noise': '环境噪声适中',
'reverberation': '混响特征为极封闭空间音效',
'speech_monotony': '语音单调性极强',
'sdr_noise': '几乎无噪声',
'audio_filename': 'malaysian-podcast_processed_44k/Cara Nak Apply Student Exchange [vFhLEniT9X8]/Cara Nak Apply Student Exchange [vFhLEniT9X8]_0.mp3',
'prompt': "一名马来西亚女性以适中基频、偏缓语速发表演讲,语音单调性极强,录制环境为极封闭空间且几乎无噪声。"}
## 新加坡播客语料
1. 原始数据源自[malaysia-ai/crawl-youtube-singaporean-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-singaporean-podcast)数据集
2. 总时长175.9小时
3. 性别预测结果文件:[sg-podcast_processed_24k_gender.zip](sg-podcast_processed_24k_gender.zip)
4. 语言预测结果文件:[sg-podcast_processed_language.zip](sg-podcast_processed_language.zip)
5. 强制对齐(Force Alignment)结果文件:[malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip)
6. 后处理统一为24kHz与44kHz采样率:
- 24kHz版本:[sg-podcast_processed_24k.zip](sg-podcast_processed_24k.zip)
- 44kHz版本:[sg-podcast_processed_44k.zip](sg-podcast_processed_44k.zip)
7. 合成描述文件:[malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet)
示例数据条目:
python
{'transcription': "你只要知道,无论你走到世界哪个角落,混蛋永远掌权。这是事实。",
'gender': '男性',
'country': '新加坡',
'utterance_pitch_mean': 124.18851470947266,
'utterance_pitch_std': 32.084354400634766,
'snr': 69.38728332519531,
'c50': 59.84521484375,
'speech_duration': 4.910624980926514,
'stoi': 0.9785327315330505,
'si-sdr': 16.752330780029297,
'pesq': 2.8572096824645996,
'pitch': '基频极低',
'speaking_rate': '语速极缓',
'noise': '录音清晰度极高',
'reverberation': '混响特征为极封闭空间音效',
'speech_monotony': '语音单调性极强',
'sdr_noise': '存在轻微噪声',
'audio_filename': 'sg-podcast_processed_44k/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]_0.mp3',
'prompt': "一名新加坡男性以极低基频发表单调性极强的演讲,录制环境为极封闭空间,存在轻微噪声但仍可清晰听清内容。"}
## 马来西亚议会语料
1. 原始数据源自[malaysia-ai/crawl-youtube-malaysia-parliament](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysia-parliament)数据集
2. 总时长2317.9小时
3. 性别预测
4. 语言预测结果文件:[parlimen-24k-chunk_processed_language.zip](parlimen-24k-chunk_processed_language.zip)
5. 强制对齐(Force Alignment)
6. 后处理统一为24kHz与44kHz采样率:
- 24kHz版本:[parlimen-24k-chunk_processed_24k.zip](parlimen-24k-chunk_processed_24k.zip)
- 44kHz版本:[parlimen-24k-chunk_processed_44k.zip](parlimen-24k-chunk_processed_44k.zip)
7. 合成描述文件:**提示词仍在生成中**。
## 源代码
所有源代码位于:https://github.com/mesolitica/malaysian-dataset/tree/master/text-to-speech/emilia-dataspeech
提供机构:
maas
创建时间:
2025-10-03



