fsicoli/common_voice_15_0
收藏数据集概述
基本信息
- 数据集名称: Common Voice Corpus 15.0
- 许可证: cc
- 任务类别: 自动语音识别
- 数据集大小: 100B<n<1T
- 标签: mozilla, foundation
语言
- Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamazight, Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba
使用方法
-
下载与加载: 使用
load_dataset函数可以从本地驱动器下载和准备数据集。例如,下载葡萄牙语配置: python from datasets import load_dataset cv_15 = load_dataset("fsicoli/common_voice_15_0", "pt", split="train") -
流式加载: 通过添加
streaming=True参数,可以流式加载数据集: python from datasets import load_dataset cv_15 = load_dataset("fsicoli/common_voice_15_0", "pt", split="train", streaming=True) print(next(iter(cv_15))) -
创建 PyTorch 数据加载器:
-
本地加载: python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_15 = load_dataset("fsicoli/common_voice_15_0", "pt", split="train") batch_sampler = BatchSampler(RandomSampler(cv_15), batch_size=32, drop_last=False) dataloader = DataLoader(cv_15, batch_sampler=batch_sampler)
-
流式加载: python from datasets import load_dataset from torch.utils.data import DataLoader cv_15 = load_dataset("fsicoli/common_voice_15_0", "pt", split="train") dataloader = DataLoader(cv_15, batch_size=32)
-
数据结构
- 数据实例: 每个数据点包含音频文件的路径及其句子。其他字段包括口音、年龄、客户端ID、赞同票、反对票、性别、地区和段落。
许可证信息
- 公共领域: CC-0
引用信息
@inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 }




