common_voice_19_0
收藏数据集卡片:Common Voice Corpus 19.0
概述
- 数据集名称: Common Voice Corpus 19.0
- 数据集类型: 语音数据集
- 任务类别: 自动语音识别
- 数据集大小: 100B < n < 1T
- 标签: mozilla, foundation
- 许可证: CC0-1.0(公共领域)
语言
- Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamazight, Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba
使用方法
下载与加载
- 使用
datasets库的load_dataset函数可以下载和预处理数据集。 - 示例:下载葡萄牙语配置 python from datasets import load_dataset cv_19 = load_dataset("fsicoli/common_voice_19_0", "pt", split="train")
流式加载
- 通过添加
streaming=True参数,可以流式加载数据集。 python from datasets import load_dataset cv_19 = load_dataset("fsicoli/common_voice_19_0", "pt", split="train", streaming=True) print(next(iter(cv_19)))
PyTorch DataLoader
-
本地加载 python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_19 = load_dataset("fsicoli/common_voice_19_0", "pt", split="train") batch_sampler = BatchSampler(RandomSampler(cv_19), batch_size=32, drop_last=False) dataloader = DataLoader(cv_19, batch_sampler=batch_sampler)
-
流式加载 python from datasets import load_dataset from torch.utils.data import DataLoader cv_19 = load_dataset("fsicoli/common_voice_19_0", "pt", split="train") dataloader = DataLoader(cv_19, batch_size=32)
数据结构
- 数据实例: 每个数据点包含音频文件路径和对应的句子。其他字段包括口音、年龄、客户端ID、点赞数、点踩数、性别、语言环境、段落等。
许可证信息
- 许可证: CC0-1.0(公共领域)
引用信息
@inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 }




