ssahir/common_voice_13_0_dv_preprocessed

Name: ssahir/common_voice_13_0_dv_preprocessed
Creator: ssahir
Published: 2023-09-27 14:47:43
License: 暂无描述

Hugging Face2023-09-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ssahir/common_voice_13_0_dv_preprocessed

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced language_creators: - crowdsourced license: - cc0-1.0 multilinguality: - multilingual size_categories: ab: - 10K<n<100K ar: - 100K<n<1M as: - 1K<n<10K ast: - 1K<n<10K az: - n<1K ba: - 100K<n<1M bas: - 1K<n<10K be: - 1M<n<10M bg: - 10K<n<100K bn: - 1M<n<10M br: - 10K<n<100K ca: - 1M<n<10M ckb: - 100K<n<1M cnh: - 1K<n<10K cs: - 100K<n<1M cv: - 10K<n<100K cy: - 100K<n<1M da: - 10K<n<100K de: - 100K<n<1M dv: - 10K<n<100K dyu: - n<1K el: - 10K<n<100K en: - 1M<n<10M eo: - 1M<n<10M es: - 1M<n<10M et: - 10K<n<100K eu: - 100K<n<1M fa: - 100K<n<1M fi: - 10K<n<100K fr: - 100K<n<1M fy-NL: - 100K<n<1M ga-IE: - 10K<n<100K gl: - 10K<n<100K gn: - 1K<n<10K ha: - 10K<n<100K hi: - 10K<n<100K hsb: - 1K<n<10K hu: - 10K<n<100K hy-AM: - 1K<n<10K ia: - 10K<n<100K id: - 10K<n<100K ig: - 1K<n<10K is: - n<1K it: - 100K<n<1M ja: - 100K<n<1M ka: - 10K<n<100K kab: - 100K<n<1M kk: - 1K<n<10K kmr: - 10K<n<100K ko: - 1K<n<10K ky: - 10K<n<100K lg: - 100K<n<1M lo: - n<1K lt: - 10K<n<100K lv: - 10K<n<100K mdf: - n<1K mhr: - 100K<n<1M mk: - n<1K ml: - 1K<n<10K mn: - 10K<n<100K mr: - 10K<n<100K mrj: - 10K<n<100K mt: - 10K<n<100K myv: - 1K<n<10K nan-tw: - 10K<n<100K ne-NP: - n<1K nl: - 10K<n<100K nn-NO: - n<1K oc: - 1K<n<10K or: - 1K<n<10K pa-IN: - 1K<n<10K pl: - 100K<n<1M pt: - 100K<n<1M quy: - n<1K rm-sursilv: - 1K<n<10K rm-vallader: - 1K<n<10K ro: - 10K<n<100K ru: - 100K<n<1M rw: - 1M<n<10M sah: - 1K<n<10K sat: - n<1K sc: - 1K<n<10K sk: - 10K<n<100K skr: - 1K<n<10K sl: - 10K<n<100K sr: - 1K<n<10K sv-SE: - 10K<n<100K sw: - 100K<n<1M ta: - 100K<n<1M th: - 100K<n<1M ti: - n<1K tig: - n<1K tk: - 1K<n<10K tok: - 10K<n<100K tr: - 10K<n<100K tt: - 10K<n<100K tw: - n<1K ug: - 10K<n<100K uk: - 10K<n<100K ur: - 100K<n<1M uz: - 100K<n<1M vi: - 10K<n<100K vot: - n<1K yo: - 1K<n<10K yue: - 10K<n<100K zh-CN: - 100K<n<1M zh-HK: - 100K<n<1M zh-TW: - 100K<n<1M source_datasets: - extended|common_voice task_categories: - automatic-speech-recognition paperswithcode_id: common-voice pretty_name: Common Voice Corpus 13.0 language_bcp47: - ab - ar - as - ast - az - ba - bas - be - bg - bn - br - ca - ckb - cnh - cs - cv - cy - da - de - dv - dyu - el - en - eo - es - et - eu - fa - fi - fr - fy-NL - ga-IE - gl - gn - ha - hi - hsb - hu - hy-AM - ia - id - ig - is - it - ja - ka - kab - kk - kmr - ko - ky - lg - lo - lt - lv - mdf - mhr - mk - ml - mn - mr - mrj - mt - myv - nan-tw - ne-NP - nl - nn-NO - oc - or - pa-IN - pl - pt - quy - rm-sursilv - rm-vallader - ro - ru - rw - sah - sat - sc - sk - skr - sl - sr - sv-SE - sw - ta - th - ti - tig - tk - tok - tr - tt - tw - ug - uk - ur - uz - vi - vot - yo - yue - zh-CN - zh-HK - zh-TW extra_gated_prompt: By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset. --- # Dataset Card for Common Voice Corpus 13.0 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://commonvoice.mozilla.org/en/datasets - **Repository:** https://github.com/common-voice/common-voice - **Paper:** https://arxiv.org/abs/1912.06670 - **Leaderboard:** https://paperswithcode.com/dataset/common-voice - **Point of Contact:** [Vaibhav Srivastav](mailto:vaibhav@huggingface.co) ### Dataset Summary The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 27141 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17689 validated hours in 108 languages, but more voices and languages are always added. Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing. ### Supported Tasks and Leaderboards The results for models trained on the Common Voice datasets are available via the [🤗 Autoevaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=ar&split=test&metric=wer) ### Languages ``` Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba ``` ## How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi" for Hindi): ```python from datasets import load_dataset cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "hi", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "hi", split="train", streaming=True) print(next(iter(cv_13))) ``` *Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). ### Local ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "hi", split="train") batch_sampler = BatchSampler(RandomSampler(cv_13), batch_size=32, drop_last=False) dataloader = DataLoader(cv_13, batch_sampler=batch_sampler) ``` ### Streaming ```python from datasets import load_dataset from torch.utils.data import DataLoader cv_13 = load_dataset("mozilla-foundation/common_voice_13_0", "hi", split="train") dataloader = DataLoader(cv_13, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ### Example scripts Train your own CTC or Seq2Seq Automatic Speech Recognition models on Common Voice 13 with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). ## Dataset Structure ### Data Instances A typical data point comprises the `path` to the audio file and its `sentence`. Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`. ```python { 'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 'path': 'et/clips/common_voice_et_18318995.mp3', 'audio': { 'path': 'et/clips/common_voice_et_18318995.mp3', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000 }, 'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': '', 'locale': 'et', 'segment': '' } ``` ### Data Fields `client_id` (`string`): An id for which client (voice) made the recording `path` (`string`): The path to the audio file `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. `sentence` (`string`): The sentence the user was prompted to speak `up_votes` (`int64`): How many upvotes the audio file has received from reviewers `down_votes` (`int64`): How many downvotes the audio file has received from reviewers `age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`) `gender` (`string`): The gender of the speaker `accent` (`string`): Accent of the speaker `locale` (`string`): The locale of the speaker `segment` (`string`): Usually an empty field ### Data Splits The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other. The validated data is data that has been validated with reviewers and received upvotes that the data is of high quality. The invalidated data is data has been invalidated by reviewers and received downvotes indicating that the data is of low quality. The reported data is data that has been reported, for different reasons. The other data is data that has not yet been reviewed. The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train. ## Data Preprocessing Recommended by Hugging Face The following are data preprocessing steps advised by the Hugging Face team. They are accompanied by an example code snippet that shows how to put them to practice. Many examples in this dataset have trailing quotations marks, e.g _“the cat sat on the mat.“_. These trailing quotation marks do not change the actual meaning of the sentence, and it is near impossible to infer whether a sentence is a quotation or not a quotation from audio data alone. In these cases, it is advised to strip the quotation marks, leaving: _the cat sat on the mat_. In addition, the majority of training sentences end in punctuation ( . or ? or ! ), whereas just a small proportion do not. In the dev set, **almost all** sentences end in punctuation. Thus, it is recommended to append a full-stop ( . ) to the end of the small number of training examples that do not end in punctuation. ```python from datasets import load_dataset ds = load_dataset("mozilla-foundation/common_voice_13_0", "en", use_auth_token=True) def prepare_dataset(batch): """Function to preprocess the dataset with the .map method""" transcription = batch["sentence"] if transcription.startswith('"') and transcription.endswith('"'): # we can remove trailing quotation marks as they do not affect the transcription transcription = transcription[1:-1] if transcription[-1] not in [".", "?", "!"]: # append a full-stop to sentences that do not end in punctuation transcription = transcription + "." batch["sentence"] = transcription return batch ds = ds.map(prepare_dataset, desc="preprocess dataset") ``` ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ## Considerations for Using the Data ### Social Impact of Dataset The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ```

提供机构：

ssahir

原始信息汇总

数据集卡片 for Common Voice Corpus 13.0

数据集描述

数据集摘要

Common Voice 数据集包含独特的 MP3 文件及其对应的文本文件。数据集中有 27141 小时的录音，还包括年龄、性别和口音等人口统计元数据，这些数据有助于提高语音识别引擎的准确性。

目前，该数据集包含 17689 小时的 108 种语言的验证数据，并且不断增加新的语音和语言。请查看 Languages 页面以请求新语言或开始贡献。

支持的任务和排行榜

在 Common Voice 数据集上训练的模型的结果可通过 🤗 Autoevaluate Leaderboard 获得。

语言

Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dioula, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Korean, Kurmanji Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Odia, Persian, Polish, Portuguese, Punjabi, Quechua Chanka, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Turkmen, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh, Yoruba

数据集结构

数据实例

一个典型的数据点包含音频文件的 path 和其 sentence。其他字段包括 accent, age, client_id, up_votes, down_votes, gender, locale 和 segment。

python { client_id: d59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5, path: et/clips/common_voice_et_18318995.mp3, audio: { path: et/clips/common_voice_et_18318995.mp3, array: array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), sampling_rate: 48000 }, sentence: Tasub kokku saada inimestega, keda tunned juba ammust ajast saati., up_votes: 2, down_votes: 0, age: twenties, gender: male, accent: , locale: et, segment: }

数据字段

client_id (string): 录音者的唯一标识符。
path (string): 音频文件的路径。
audio (dict): 包含音频文件路径、解码的音频数组和采样率的字典。
sentence (string): 用户被提示说的句子。
up_votes (int64): 音频文件获得的赞数。
down_votes (int64): 音频文件获得的踩数。
age (string): 说话者的年龄。
gender (string): 说话者的性别。
accent (string): 说话者的口音。
locale (string): 说话者的地区。
segment (string): 通常是一个空字段。

数据分割

语音数据已被细分为 dev、train、test、validated、invalidated、reported 和其他部分。

validated 数据是经过审核者验证并获得赞的高质量数据。
invalidated 数据是经过审核者验证并获得踩的低质量数据。
reported 数据是因不同原因被报告的数据。
other 数据是尚未审核的数据。
dev, test, train 是经过审核、被认为高质量并分为 dev、test 和 train 的数据。

数据预处理建议

以下是 Hugging Face 团队建议的数据预处理步骤，并附有示例代码片段。

许多示例句子末尾有引号，例如 “the cat sat on the mat.“。这些引号并不改变句子的实际含义，建议去掉引号，留下 the cat sat on the mat。

此外，大多数训练句子以标点符号（. 或 ? 或 !）结尾，而只有少数句子没有。在 dev 集中，几乎所有句子都以标点符号结尾。因此，建议在少数没有标点符号的训练示例末尾添加句号（.）。

python from datasets import load_dataset

ds = load_dataset("mozilla-foundation/common_voice_13_0", "en", use_auth_token=True)

def prepare_dataset(batch): """Function to preprocess the dataset with the .map method""" transcription = batch["sentence"]

if transcription.startswith(") and transcription.endswith("): # we can remove trailing quotation marks as they do not affect the transcription transcription = transcription[1:-1]

if transcription[-1] not in [".", "?", "!"]: # append a full-stop to sentences that do not end in punctuation transcription = transcription + "."

batch["sentence"] = transcription

return batch

ds = ds.map(prepare_dataset, desc="preprocess dataset")

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言生产者是谁？

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

数据集包含在线捐赠语音的人。您同意不尝试确定 Common Voice 数据集中说话者的身份。

使用数据时的注意事项

数据集的社会影响

数据集包含在线捐赠语音的人。您同意不尝试确定 Common Voice 数据集中说话者的身份。

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策展人

[需要更多信息]

许可信息

公共领域，CC-0

引用信息

@inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 }

5,000+

优质数据集

54 个

任务类型

进入经典数据集