five

fsicoli/common_voice_16_0

收藏
Hugging Face2023-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fsicoli/common_voice_16_0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 language: - ab - af - am - ar - as - ast - az - ba - bas - be - bg - bn - br - ca - ckb - cnh - cs - cv - cy - da - de - dv - dyu - el - en - eo - es - et - eu - fa - fi - fr - gl - gn - ha - he - hi - hsb - hu - ia - id - ig - is - it - ja - ka - kab - kk - kmr - ko - ky - lg - lo - lt - lv - mdf - mhr - mk - ml - mn - mr - mrj - mt - myv - nl - oc - or - pl - ps - pt - quy - ro - ru - rw - sah - sat - sc - sk - skr - sl - sq - sr - sw - ta - th - ti - tig - tk - tok - tr - tt - tw - ug - uk - ur - uz - vi - vot - yue - zgh - zh - yo task_categories: - automatic-speech-recognition pretty_name: Common Voice Corpus 16.0 size_categories: - 100B<n<1T tags: - mozilla - foundation --- # Dataset Card for Common Voice Corpus 16.0 <!-- Provide a quick summary of the dataset. --> This dataset is an unofficial version of the Mozilla Common Voice Corpus 16. It was downloaded and converted from the project's website https://commonvoice.mozilla.org/. ## Languages ``` Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Pashto, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamazight, Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba ``` ## How to use The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function. For example, to download the Portuguese config, simply specify the corresponding language config name (i.e., "pt" for Portuguese): ``` from datasets import load_dataset cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ``` from datasets import load_dataset cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train", streaming=True) print(next(iter(cv_16))) ``` Bonus: create a PyTorch dataloader directly with your own datasets (local/streamed). ### Local ``` from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train") batch_sampler = BatchSampler(RandomSampler(cv_16), batch_size=32, drop_last=False) dataloader = DataLoader(cv_16, batch_sampler=batch_sampler) ``` ### Streaming ``` from datasets import load_dataset from torch.utils.data import DataLoader cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train") dataloader = DataLoader(cv_16, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to hf.co/blog/audio-datasets. ### Dataset Structure Data Instances A typical data point comprises the path to the audio file and its sentence. Additional fields include accent, age, client_id, up_votes, down_votes, gender, locale and segment. ### Licensing Information Public Domain, CC-0 ### Citation Information ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ``` ---
提供机构:
fsicoli
原始信息汇总

数据集卡片:Common Voice Corpus 16.0

概述

Common Voice Corpus 16.0 是一个非官方版本的 Mozilla Common Voice 语料库。该数据集从项目网站下载并转换而来。

语言

数据集包含以下语言:

  • 阿布哈兹语, 阿尔巴尼亚语, 阿姆哈拉语, 阿拉伯语, 亚美尼亚语, 阿萨姆语, 阿斯图里亚斯语, 阿塞拜疆语, 巴萨语, 巴什基尔语, 巴斯克语, 白俄罗斯语, 孟加拉语, 布列塔尼语, 保加利亚语, 粤语, 加泰罗尼亚语, 中库尔德语, 中文(中国), 中文(香港), 中文(台湾), 楚瓦什语, 捷克语, 丹麦语, 迪维希语, 迪乌拉语, 荷兰语, 英语, 埃尔兹亚语, 世界语, 爱沙尼亚语, 芬兰语, 法语, 弗里西亚语, 加利西亚语, 格鲁吉亚语, 德语, 希腊语, 瓜拉尼语, 哈卡钦语, 豪萨语, 希尔马里语, 印地语, 匈牙利语, 冰岛语, 伊博语, 印度尼西亚语, 因特林瓜语, 爱尔兰语, 意大利语, 日语, 卡拜尔语, 哈萨克语, 基尼亚卢旺达语, 韩语, 库尔德语(库尔曼吉), 吉尔吉斯语, 老挝语, 拉脱维亚语, 立陶宛语, 卢干达语, 马其顿语, 马拉雅拉姆语, 马耳他语, 马拉地语, 草原马里语, 莫克沙语, 蒙古语, 尼泊尔语, 挪威尼诺斯克语, 奥克西坦语, 奥里亚语, 普什图语, 波斯语, 波兰语, 葡萄牙语, 旁遮普语, 昌卡语, 罗马尼亚语, 罗曼什语(苏尔西尔文), 罗曼什语(瓦拉德语), 俄语, 萨哈语, 桑塔利语(奥尔奇基语), 萨拉伊基语, 撒丁语, 塞尔维亚语, 斯洛伐克语, 斯洛文尼亚语, 索布语, 上索布语, 西班牙语, 斯瓦希里语, 瑞典语, 台湾闽南语, 塔马齐格特语, 泰米尔语, 塔塔尔语, 泰语, 提格雷语, 提格里尼亚语, 托克皮辛语, 土耳其语, 土库曼语, 特威语, 乌克兰语, 乌尔都语, 维吾尔语, 乌兹别克语, 越南语, 沃蒂克语, 威尔士语, 约鲁巴语

使用方法

可以使用 datasets 库在纯 Python 环境中加载和预处理数据集。通过 load_dataset 函数可以下载并准备数据集到本地驱动器。

例如,下载葡萄牙语配置: python from datasets import load_dataset

cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train")

通过添加 streaming=True 参数,可以流式加载数据集: python from datasets import load_dataset

cv_16 = load_dataset("fsicoli/common_voice_16_0", "pt", split="train", streaming=True) print(next(iter(cv_16)))

数据集结构

每个数据点包含音频文件路径及其对应的句子。其他字段包括口音、年龄、客户端ID、赞同票数、反对票数、性别、地区和段落。

许可信息

公共领域,CC-0 许可。

引用信息

@inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作