fsicoli/common_voice_19_0
收藏Hugging Face2024-09-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/fsicoli/common_voice_19_0
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是Mozilla Common Voice Corpus 19.0的非官方版本,包含了多种语言的语音数据,用于自动语音识别任务。数据集支持超过100种语言,包括但不限于阿布哈兹语、阿尔巴尼亚语、阿姆哈拉语、阿拉伯语、亚美尼亚语、阿萨姆语、阿斯图里亚斯语、阿塞拜疆语、巴萨克语、巴什基尔语、巴斯克语、白俄罗斯语、孟加拉语、布列塔尼语、保加利亚语、粤语、加泰罗尼亚语、中央库尔德语、中文(中国)、中文(香港)、中文(台湾)、楚瓦什语、捷克语、丹麦语、迪维希语、迪乌拉语、荷兰语、英语、埃尔齐亚语、世界语、爱沙尼亚语、芬兰语、法语、弗里斯兰语、加利西亚语、格鲁吉亚语、德语、希腊语、瓜拉尼语、哈卡钦语、豪萨语、山地马里语、印地语、匈牙利语、冰岛语、伊博语、印度尼西亚语、国际语、爱尔兰语、意大利语、日语、卡拜尔语、哈萨克语、基尼亚卢旺达语、韩语、库尔德语、吉尔吉斯语、老挝语、拉脱维亚语、立陶宛语、卢干达语、马其顿语、马拉雅拉姆语、马耳他语、马拉地语、草地马里语、莫克沙语、蒙古语、尼泊尔语、挪威尼诺斯克语、奥克西唐语、奥里亚语、普什图语、波斯语、波兰语、葡萄牙语、旁遮普语、查卡克丘亚语、罗马尼亚语、罗曼什语、俄语、萨哈语、桑塔利语(奥尔奇基文)、萨拉基语、撒丁语、塞尔维亚语、斯洛伐克语、斯洛文尼亚语、上索布语、西班牙语、斯瓦希里语、瑞典语、台湾闽南语、塔马齐格特语、泰米尔语、塔塔尔语、泰语、提格利尼亚语、托克皮辛语、土耳其语、土库曼语、特威语、乌克兰语、乌尔都语、维吾尔语、乌兹别克语、越南语、沃提克语、威尔士语、约鲁巴语。数据集的使用方法包括通过datasets库加载和预处理数据集,以及创建PyTorch数据加载器。数据集的结构包括音频文件路径和对应的句子,以及其他字段如口音、年龄、客户端ID、赞成票、反对票、性别、地区和片段。数据集遵循公共领域CC-0许可。
This dataset is an unofficial version of the Mozilla Common Voice Corpus 19.0, containing speech data in multiple languages for automatic speech recognition tasks. The dataset supports over 100 languages, including but not limited to Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamazight, Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba. The dataset can be loaded and preprocessed using the datasets library, and PyTorch dataloaders can be created. The dataset structure includes the path to the audio file and its corresponding sentence, along with additional fields such as accent, age, client_id, up_votes, down_votes, gender, locale, and segment. The dataset is licensed under the public domain CC-0.
提供机构:
fsicoli



