common_voice_21_0
收藏Common Voice Corpus 21.0 数据集概述
基本信息
- 许可证: CC0-1.0
- 任务类别: 自动语音识别
- 数据集名称: Common Voice Corpus 21.0
- 大小类别: 100B < n < 1T
- 标签: mozilla, foundation
语言支持
支持以下语言: Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamazight, Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba
使用方法
加载数据集
python from datasets import load_dataset cv_21 = load_dataset("fsicoli/common_voice_21_0", "pt", split="train")
流式加载
python from datasets import load_dataset cv_21 = load_dataset("fsicoli/common_voice_21_0", "pt", split="train", streaming=True) print(next(iter(cv_21)))
创建PyTorch数据加载器
本地模式
python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_21 = load_dataset("fsicoli/common_voice_21_0", "pt", split="train") batch_sampler = BatchSampler(RandomSampler(cv_21), batch_size=32, drop_last=False) dataloader = DataLoader(cv_21, batch_sampler=batch_sampler)
流式模式
python from datasets import load_dataset from torch.utils.data import DataLoader cv_21 = load_dataset("fsicoli/common_voice_21_0", "pt", split="train") dataloader = DataLoader(cv_21, batch_size=32)
数据结构
- 数据实例: 包含音频文件路径、句子、口音、年龄、客户端ID、赞成票、反对票、性别、区域和片段等信息。
许可信息
- 许可证类型: 公共领域, CC-0
引用信息
bibtex @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 }




