five

Malaysian-Emilia-annotated

收藏
魔搭社区2025-10-10 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/mesolitica/Malaysian-Emilia-annotated
下载链接
链接失效反馈
官方服务:
资源简介:
# Malaysian Emilia Annotated Annotate [Malaysian-Emilia](https://huggingface.co/datasets/mesolitica/Malaysian-Emilia) using [Data-Speech](https://github.com/huggingface/dataspeech) pipeline. ## Malaysian Youtube 1. Originally from [malaysia-ai/crawl-youtube](https://huggingface.co/datasets/malaysia-ai/crawl-youtube) 2. Total 3168.8 hours. 3. Gender prediction, [filtered-24k_processed_24k_gender.zip](filtered-24k_processed_24k_gender.zip) 4. Language prediction, [filtered-24k_processed_language.zip](filtered-24k_processed_language.zip) 5. 5. Force alignment. 6. Post cleaned to 24k and 44k sampling rates, - 24k, [filtered-24k_processed_24k.zip](filtered-24k_processed_24k.zip) - 44k, [filtered-24k_processed_44k.zip](filtered-24k_processed_44k.zip) 6. Synthetic description, [malaysian-emilia-youtube.parquet](data/malaysian_emilia_youtube-00000-of-00001.parquet), ```python {'transcription': "Hey guys, assalamualaikum. It's me, Nina and welcome back to Nina's Story. Setiap negara ada undang-undang yang tersendiri untuk menghukum orang yang melakukan kesalahan.", 'gender': 'female', 'country': 'malaysian', 'utterance_pitch_mean': 218.09979248046875, 'utterance_pitch_std': 44.81846237182617, 'snr': 58.60026931762695, 'c50': 59.760154724121094, 'speech_duration': 9.365625381469727, 'stoi': 0.9753543138504028, 'si-sdr': 13.493837356567383, 'pesq': 2.6889467239379883, 'pitch': 'slightly low pitch', 'speaking_rate': 'slightly slowly', 'reverberation': 'very confined sounding', 'speech_monotony': 'very monotone', 'sdr_noise': 'slightly noisy', 'audio_filename': 'filtered-24k_processed_24k/00463-21/00463-21_0.mp3'} ``` **Prompt still on generating**. ## Malaysian Podcast 1. Originally from [malaysia-ai/crawl-youtube-malaysian-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysian-podcast) 2. Total 622.8 hours. 3. Gender prediction, [malaysian-podcast_processed_24k_gender.zip](malaysian-podcast_processed_24k_gender.zip) 4. Language prediction, [malaysian-podcast_processed_language.zip](malaysian-podcast_processed_language.zip) 5. Force alignment, [malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip) 6. Post cleaned to 24k and 44k sampling rates, - 24k, [malaysian-podcast_processed_24k.zip](malaysian-podcast_processed_24k.zip) - 44k, [malaysian-podcast_processed_44k.zip](malaysian-podcast_processed_44k.zip) 6. Synthetic description, [malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet), ```python {'transcription': 'Cara nak apply, macam Puteri kan time internship. So, Puteri punya keluar dekat group internship, aa, dia keluar satu form.', 'gender': 'female', 'country': 'malaysian', 'utterance_pitch_mean': 259.931396484375, 'utterance_pitch_std': 46.01287841796875, 'snr': 41.81050491333008, 'c50': 59.3415641784668, 'speech_duration': 7.661250114440918, 'stoi': 0.9756626486778259, 'si-sdr': 20.618106842041016, 'pesq': 3.326802968978882, 'pitch': 'moderate pitch', 'speaking_rate': 'quite slowly', 'noise': 'moderate ambient sound', 'reverberation': 'very confined sounding', 'speech_monotony': 'very monotone', 'sdr_noise': 'almost no noise', 'audio_filename': 'malaysian-podcast_processed_44k/Cara Nak Apply Student Exchange [vFhLEniT9X8]/Cara Nak Apply Student Exchange [vFhLEniT9X8]_0.mp3', 'prompt': 'A Malaysian woman delivers a very monotone speech with a moderate pitch, speaking quite slowly in a very confined and almost noise-free environment.'} ``` ## Singaporean Podcast 1. Originally from [malaysia-ai/crawl-youtube-singaporean-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-singaporean-podcast) 2. Total 175.9 hours. 3. Gender prediction, [sg-podcast_processed_24k_gender.zip](sg-podcast_processed_24k_gender.zip) 4. Language prediction, [sg-podcast_processed_language.zip](sg-podcast_processed_language.zip) 5. Force alignment, [malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip) 6. Post cleaned to 24k and 44k sampling rates, - 24k, [sg-podcast_processed_24k.zip](sg-podcast_processed_24k.zip) - 44k, [sg-podcast_processed_44k.zip](sg-podcast_processed_44k.zip) 6. Synthetic description, [malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet), ```python {'transcription': "You just know, wherever you go in the world, the asshole is always in control. It's true.", 'gender': 'male', 'country': 'singaporean', 'utterance_pitch_mean': 124.18851470947266, 'utterance_pitch_std': 32.084354400634766, 'snr': 69.38728332519531, 'c50': 59.84521484375, 'speech_duration': 4.910624980926514, 'stoi': 0.9785327315330505, 'si-sdr': 16.752330780029297, 'pesq': 2.8572096824645996, 'pitch': 'very low pitch', 'speaking_rate': 'very slowly', 'noise': 'very clear', 'reverberation': 'very confined sounding', 'speech_monotony': 'very monotone', 'sdr_noise': 'slightly noisy', 'audio_filename': 'sg-podcast_processed_44k/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]_0.mp3', 'prompt': 'A Singaporean man speaks with a very monotone and very low-pitched voice, creating a very confined and slightly echo-y sound. The recording is slightly noisy but still allows for clear understanding.'} ``` ## Malaysia Parliament 1. Originally from [malaysia-ai/crawl-youtube-malaysia-parliament](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysia-parliament) 2. Total 2317.9 hours. 3. Gender prediction. 4. Language prediction, [parlimen-24k-chunk_processed_language.zip](parlimen-24k-chunk_processed_language.zip) 5. Force alignment. 6. Post cleaned to 24k and 44k sampling rates, - 24k, [parlimen-24k-chunk_processed_24k.zip](parlimen-24k-chunk_processed_24k.zip) - 44k, [parlimen-24k-chunk_processed_44k.zip](parlimen-24k-chunk_processed_44k.zip) 6. Synthetic description, **Prompt still on generating**. ## Source code All source code at https://github.com/mesolitica/malaysian-dataset/tree/master/text-to-speech/emilia-dataspeech

# 马来西亚Emilia标注数据集 使用[Data-Speech](https://github.com/huggingface/dataspeech)流水线对[Malaysian-Emilia](https://huggingface.co/datasets/mesolitica/Malaysian-Emilia)数据集进行标注。 ## 马来西亚YouTube语料 1. 原始数据源自[malaysia-ai/crawl-youtube](https://huggingface.co/datasets/malaysia-ai/crawl-youtube)数据集 2. 总时长3168.8小时 3. 性别预测结果文件:[filtered-24k_processed_24k_gender.zip](filtered-24k_processed_24k_gender.zip) 4. 语言预测结果文件:[filtered-24k_processed_language.zip](filtered-24k_processed_language.zip) 5. 强制对齐(Force Alignment) 6. 后处理统一为24kHz与44kHz采样率: - 24kHz版本:[filtered-24k_processed_24k.zip](filtered-24k_processed_24k.zip) - 44kHz版本:[filtered-24k_processed_44k.zip](filtered-24k_processed_44k.zip) 7. 合成描述文件:[malaysian-emilia-youtube.parquet](data/malaysian_emilia_youtube-00000-of-00001.parquet) 以下为一条示例数据条目: python {'transcription': "Hey guys, assalamualaikum. It's me, Nina and welcome back to Nina's Story. 每个国家都有针对惩处违法者的专属法律。", 'gender': '女性', 'country': '马来西亚', 'utterance_pitch_mean': 218.09979248046875, 'utterance_pitch_std': 44.81846237182617, 'snr': 58.60026931762695, 'c50': 59.760154724121094, 'speech_duration': 9.365625381469727, 'stoi': 0.9753543138504028, 'si-sdr': 13.493837356567383, 'pesq': 2.6889467239379883, 'pitch': '基频略低', 'speaking_rate': '语速稍缓', 'reverberation': '混响特征为极封闭空间音效', 'speech_monotony': '语音单调性极强', 'sdr_noise': '存在轻微噪声', 'audio_filename': 'filtered-24k_processed_24k/00463-21/00463-21_0.mp3'} **提示词仍在生成中**。 ## 马来西亚播客语料 1. 原始数据源自[malaysia-ai/crawl-youtube-malaysian-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysian-podcast)数据集 2. 总时长622.8小时 3. 性别预测结果文件:[malaysian-podcast_processed_24k_gender.zip](malaysian-podcast_processed_24k_gender.zip) 4. 语言预测结果文件:[malaysian-podcast_processed_language.zip](malaysian-podcast_processed_language.zip) 5. 强制对齐(Force Alignment)结果文件:[malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip) 6. 后处理统一为24kHz与44kHz采样率: - 24kHz版本:[malaysian-podcast_processed_24k.zip](malaysian-podcast_processed_24k.zip) - 44kHz版本:[malaysian-podcast_processed_44k.zip](malaysian-podcast_processed_44k.zip) 7. 合成描述文件:[malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet) 示例数据条目: python {'transcription': "申请方式就像普特丽实习的时候那样对吧。当时普特丽在实习群里发了一份申请表。", 'gender': '女性', 'country': '马来西亚', 'utterance_pitch_mean': 259.931396484375, 'utterance_pitch_std': 46.01287841796875, 'snr': 41.81050491333008, 'c50': 59.3415641784668, 'speech_duration': 7.661250114440918, 'stoi': 0.9756626486778259, 'si-sdr': 20.618106842041016, 'pesq': 3.326802968978882, 'pitch': '基频适中', 'speaking_rate': '语速偏缓', 'noise': '环境噪声适中', 'reverberation': '混响特征为极封闭空间音效', 'speech_monotony': '语音单调性极强', 'sdr_noise': '几乎无噪声', 'audio_filename': 'malaysian-podcast_processed_44k/Cara Nak Apply Student Exchange [vFhLEniT9X8]/Cara Nak Apply Student Exchange [vFhLEniT9X8]_0.mp3', 'prompt': "一名马来西亚女性以适中基频、偏缓语速发表演讲,语音单调性极强,录制环境为极封闭空间且几乎无噪声。"} ## 新加坡播客语料 1. 原始数据源自[malaysia-ai/crawl-youtube-singaporean-podcast](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-singaporean-podcast)数据集 2. 总时长175.9小时 3. 性别预测结果文件:[sg-podcast_processed_24k_gender.zip](sg-podcast_processed_24k_gender.zip) 4. 语言预测结果文件:[sg-podcast_processed_language.zip](sg-podcast_processed_language.zip) 5. 强制对齐(Force Alignment)结果文件:[malaysian-podcast_processed_alignment.zip](malaysian-podcast_processed_alignment.zip) 6. 后处理统一为24kHz与44kHz采样率: - 24kHz版本:[sg-podcast_processed_24k.zip](sg-podcast_processed_24k.zip) - 44kHz版本:[sg-podcast_processed_44k.zip](sg-podcast_processed_44k.zip) 7. 合成描述文件:[malaysian-emilia-podcast.parquet](data/malaysian_emilia_podcast-00000-of-00001.parquet) 示例数据条目: python {'transcription': "你只要知道,无论你走到世界哪个角落,混蛋永远掌权。这是事实。", 'gender': '男性', 'country': '新加坡', 'utterance_pitch_mean': 124.18851470947266, 'utterance_pitch_std': 32.084354400634766, 'snr': 69.38728332519531, 'c50': 59.84521484375, 'speech_duration': 4.910624980926514, 'stoi': 0.9785327315330505, 'si-sdr': 16.752330780029297, 'pesq': 2.8572096824645996, 'pitch': '基频极低', 'speaking_rate': '语速极缓', 'noise': '录音清晰度极高', 'reverberation': '混响特征为极封闭空间音效', 'speech_monotony': '语音单调性极强', 'sdr_noise': '存在轻微噪声', 'audio_filename': 'sg-podcast_processed_44k/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]/Have you heard about the 🧠& 🍑👌? #shorts [DiQFH3xhSoo]_0.mp3', 'prompt': "一名新加坡男性以极低基频发表单调性极强的演讲,录制环境为极封闭空间,存在轻微噪声但仍可清晰听清内容。"} ## 马来西亚议会语料 1. 原始数据源自[malaysia-ai/crawl-youtube-malaysia-parliament](https://huggingface.co/datasets/malaysia-ai/crawl-youtube-malaysia-parliament)数据集 2. 总时长2317.9小时 3. 性别预测 4. 语言预测结果文件:[parlimen-24k-chunk_processed_language.zip](parlimen-24k-chunk_processed_language.zip) 5. 强制对齐(Force Alignment) 6. 后处理统一为24kHz与44kHz采样率: - 24kHz版本:[parlimen-24k-chunk_processed_24k.zip](parlimen-24k-chunk_processed_24k.zip) - 44kHz版本:[parlimen-24k-chunk_processed_44k.zip](parlimen-24k-chunk_processed_44k.zip) 7. 合成描述文件:**提示词仍在生成中**。 ## 源代码 所有源代码位于:https://github.com/mesolitica/malaysian-dataset/tree/master/text-to-speech/emilia-dataspeech
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作