openslr/openslr

Name: openslr/openslr
Creator: openslr
Published: 2024-08-14 14:12:45
License: 暂无描述

Hugging Face2024-08-14 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/openslr/openslr

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: OpenSLR annotations_creators: - found language_creators: - found language: - af - bn - ca - en - es - eu - gl - gu - jv - km - kn - ml - mr - my - ne - si - st - su - ta - te - tn - ve - xh - yo language_bcp47: - en-GB - en-IE - en-NG - es-CL - es-CO - es-PE - es-PR license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - automatic-speech-recognition task_ids: [] paperswithcode_id: null dataset_info: - config_name: SLR41 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2423902 num_examples: 5822 download_size: 1890792360 dataset_size: 2423902 - config_name: SLR42 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1427984 num_examples: 2906 download_size: 866086951 dataset_size: 1427984 - config_name: SLR43 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1074005 num_examples: 2064 download_size: 800375645 dataset_size: 1074005 - config_name: SLR44 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1776827 num_examples: 4213 download_size: 1472252752 dataset_size: 1776827 - config_name: SLR63 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2016587 num_examples: 4126 download_size: 1345876299 dataset_size: 2016587 - config_name: SLR64 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 810375 num_examples: 1569 download_size: 712155683 dataset_size: 810375 - config_name: SLR65 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2136447 num_examples: 4284 download_size: 1373304655 dataset_size: 2136447 - config_name: SLR66 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1898335 num_examples: 4448 download_size: 1035127870 dataset_size: 1898335 - config_name: SLR69 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1647263 num_examples: 4240 download_size: 1848659543 dataset_size: 1647263 - config_name: SLR35 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 73565374 num_examples: 185076 download_size: 18900105726 dataset_size: 73565374 - config_name: SLR36 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 88942337 num_examples: 219156 download_size: 22996553929 dataset_size: 88942337 - config_name: SLR70 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1339608 num_examples: 3359 download_size: 1213955196 dataset_size: 1339608 - config_name: SLR71 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1676273 num_examples: 4374 download_size: 1445365903 dataset_size: 1676273 - config_name: SLR72 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1876301 num_examples: 4903 download_size: 1612030532 dataset_size: 1876301 - config_name: SLR73 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2084052 num_examples: 5447 download_size: 1940306814 dataset_size: 2084052 - config_name: SLR74 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 237395 num_examples: 617 download_size: 214181314 dataset_size: 237395 - config_name: SLR75 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1286937 num_examples: 3357 download_size: 1043317004 dataset_size: 1286937 - config_name: SLR76 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2756507 num_examples: 7136 download_size: 3041125513 dataset_size: 2756507 - config_name: SLR77 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2217652 num_examples: 5587 download_size: 2207991775 dataset_size: 2217652 - config_name: SLR78 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2121986 num_examples: 4272 download_size: 1743222102 dataset_size: 2121986 - config_name: SLR79 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 2176539 num_examples: 4400 download_size: 1820919115 dataset_size: 2176539 - config_name: SLR80 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1308651 num_examples: 2530 download_size: 948181015 dataset_size: 1308651 - config_name: SLR86 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 1378801 num_examples: 3583 download_size: 907065562 dataset_size: 1378801 - config_name: SLR32 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 4544052380 num_examples: 9821 download_size: 3312884763 dataset_size: 4544052380 - config_name: SLR52 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 77369899 num_examples: 185293 download_size: 14676484074 dataset_size: 77369899 - config_name: SLR53 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 88073248 num_examples: 218703 download_size: 14630810921 dataset_size: 88073248 - config_name: SLR54 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 62735822 num_examples: 157905 download_size: 9328247362 dataset_size: 62735822 - config_name: SLR83 features: - name: path dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: sentence dtype: string splits: - name: train num_bytes: 7098985 num_examples: 17877 download_size: 7229890819 dataset_size: 7098985 config_names: - SLR32 - SLR35 - SLR36 - SLR41 - SLR42 - SLR43 - SLR44 - SLR52 - SLR53 - SLR54 - SLR63 - SLR64 - SLR65 - SLR66 - SLR69 - SLR70 - SLR71 - SLR72 - SLR73 - SLR74 - SLR75 - SLR76 - SLR77 - SLR78 - SLR79 - SLR80 - SLR83 - SLR86 --- # Dataset Card for openslr ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.openslr.org/ - **Repository:** [Needs More Information] - **Paper:** [Needs More Information] - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. Currently, following resources are available: #### SLR32: High quality TTS data for four South African languages (af, st, tn, xh). This data set contains multi-speaker high quality transcribed audio data for four languages of South Africa. The data set consists of wave files, and a TSV file transcribing the audio. In each folder, the file line_index.tsv contains a FileID, which in turn contains the UserID and the Transcription of audio in the file. The data set has had some quality checks, but there might still be errors. This data set was collected by as a collaboration between North West University and Google. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See https://github.com/google/language-resources#license for license information. Copyright 2017 Google, Inc. #### SLR35: Large Javanese ASR training data set. This data set contains transcribed audio data for Javanese (~185K utterances). The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google in collaboration with Reykjavik University and Universitas Gadjah Mada in Indonesia. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/35/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017 Google, Inc. #### SLR36: Large Sundanese ASR training data set. This data set contains transcribed audio data for Sundanese (~220K utterances). The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google in Indonesia. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/36/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017 Google, Inc. #### SLR41: High quality TTS data for Javanese. This data set contains high-quality transcribed audio data for Javanese. The data set consists of wave files, and a TSV file. The file line_index.tsv contains a filename and the transcription of audio in the file. Each filename is prepended with a speaker identification number. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google in collaboration with Gadjah Mada University in Indonesia. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/41/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google LLC #### SLR42: High quality TTS data for Khmer. This data set contains high-quality transcribed audio data for Khmer. The data set consists of wave files, and a TSV file. The file line_index.tsv contains a filename and the transcription of audio in the file. Each filename is prepended with a speaker identification number. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/42/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google LLC #### SLR43: High quality TTS data for Nepali. This data set contains high-quality transcribed audio data for Nepali. The data set consists of wave files, and a TSV file. The file line_index.tsv contains a filename and the transcription of audio in the file. Each filename is prepended with a speaker identification number. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google in Nepal. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/43/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google LLC #### SLR44: High quality TTS data for Sundanese. This data set contains high-quality transcribed audio data for Sundanese. The data set consists of wave files, and a TSV file. The file line_index.tsv contains a filename and the transcription of audio in the file. Each filename is prepended with a speaker identification number. The data set has been manually quality checked, but there might still be errors. This dataset was collected by Google in collaboration with Universitas Pendidikan Indonesia. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/44/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google LLC #### SLR52: Large Sinhala ASR training data set. This data set contains transcribed audio data for Sinhala (~185K utterances). The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/52/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google, Inc. #### SLR53: Large Bengali ASR training data set. This data set contains transcribed audio data for Bengali (~196K utterances). The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/53/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google, Inc. #### SLR54: Large Nepali ASR training data set. This data set contains transcribed audio data for Nepali (~157K utterances). The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/54/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2016, 2017, 2018 Google, Inc. #### SLR63: Crowdsourced high-quality Malayalam multi-speaker speech data set This data set contains transcribed high-quality audio of Malayalam sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/63/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR64: Crowdsourced high-quality Marathi multi-speaker speech data set This data set contains transcribed high-quality audio of Marathi sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/64/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR65: Crowdsourced high-quality Tamil multi-speaker speech data set This data set contains transcribed high-quality audio of Tamil sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/65/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR66: Crowdsourced high-quality Telugu multi-speaker speech data set This data set contains transcribed high-quality audio of Telugu sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/66/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR69: Crowdsourced high-quality Catalan multi-speaker speech data set This data set contains transcribed high-quality audio of Catalan sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/69/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR70: Crowdsourced high-quality Nigerian English speech data set This data set contains transcribed high-quality audio of Nigerian English sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/70/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR71: Crowdsourced high-quality Chilean Spanish speech data set This data set contains transcribed high-quality audio of Chilean Spanish sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/71/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR72: Crowdsourced high-quality Colombian Spanish speech data set This data set contains transcribed high-quality audio of Colombian Spanish sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/72/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR73: Crowdsourced high-quality Peruvian Spanish speech data set This data set contains transcribed high-quality audio of Peruvian Spanish sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/73/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR74: Crowdsourced high-quality Puerto Rico Spanish speech data set This data set contains transcribed high-quality audio of Puerto Rico Spanish sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/74/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR75: Crowdsourced high-quality Venezuelan Spanish speech data set This data set contains transcribed high-quality audio of Venezuelan Spanish sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/75/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR76: Crowdsourced high-quality Basque speech data set This data set contains transcribed high-quality audio of Basque sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/76/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR77: Crowdsourced high-quality Galician speech data set This data set contains transcribed high-quality audio of Galician sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/77/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR78: Crowdsourced high-quality Gujarati multi-speaker speech data set This data set contains transcribed high-quality audio of Gujarati sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/78/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR79: Crowdsourced high-quality Kannada multi-speaker speech data set This data set contains transcribed high-quality audio of Kannada sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/79/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR80: Crowdsourced high-quality Burmese speech data set This data set contains transcribed high-quality audio of Burmese sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/80/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR83: Crowdsourced high-quality UK and Ireland English Dialect speech data set This data set contains transcribed high-quality audio of English sentences recorded by volunteers speaking different dialects of the language. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.csv contains a line id, an anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. The recordings from the Welsh English speakers were collected in collaboration with Cardiff University. The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/83/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019 Google, Inc. #### SLR86: Crowdsourced high-quality multi-speaker speech data set This data set contains transcribed high-quality audio of sentences recorded by volunteers. The data set consists of wave files, and a TSV file (line_index.tsv). The file line_index.tsv contains a anonymized FileID and the transcription of audio in the file. The data set has been manually quality checked, but there might still be errors. Please report any issues in the following issue tracker on GitHub. https://github.com/googlei18n/language-resources/issues The dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License. See [LICENSE](https://www.openslr.org/resources/86/LICENSE) file and https://github.com/google/language-resources#license for license information. Copyright 2018, 2019, 2020 Google, Inc. ### Supported Tasks and Leaderboards [Needs More Information] ### Languages Javanese, Khmer, Nepali, Sundanese, Malayalam, Marathi, Tamil, Telugu, Catalan, Nigerian English, Chilean Spanish, Columbian Spanish, Peruvian Spanish, Puerto Rico Spanish, Venezuelan Spanish, Basque, Galician, Gujarati, Kannada, Afrikaans, Sesotho, Setswana and isiXhosa. ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file, called path and its sentence. #### SLR32, SLR35, SLR36, SLR41, SLR42, SLR43, SLR44, SLR52, SLR53, SLR54, SLR63, SLR64, SLR65, SLR66, SLR69, SLR70, SLR71, SLR72, SLR73, SLR74, SLR75, SLR76, SLR77, SLR78, SLR79, SLR80, SLR86 ``` { 'path': '/home/cahya/.cache/huggingface/datasets/downloads/extracted/4d9cf915efc21110199074da4d492566dee6097068b07a680f670fcec9176e62/su_id_female/wavs/suf_00297_00037352660.wav' 'audio': {'path': '/home/cahya/.cache/huggingface/datasets/downloads/extracted/4d9cf915efc21110199074da4d492566dee6097068b07a680f670fcec9176e62/su_id_female/wavs/suf_00297_00037352660.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000}, 'sentence': 'Panonton ting haruleng ningali Kelly Clarkson keur nyanyi di tipi', } ``` ### Data Fields - `path`: The path to the audio file. - `audio`: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. - `sentence`: The sentence the user was prompted to speak. ### Data Splits There is only one "train" split for all configurations and the number of examples are: | | Number of examples | |:------|---------------------:| | SLR41 | 5822 | | SLR42 | 2906 | | SLR43 | 2064 | | SLR44 | 4213 | | SLR63 | 4126 | | SLR64 | 1569 | | SLR65 | 4284 | | SLR66 | 4448 | | SLR69 | 4240 | | SLR35 | 185076 | | SLR36 | 219156 | | SLR70 | 3359 | | SLR71 | 4374 | | SLR72 | 4903 | | SLR73 | 5447 | | SLR74 | 617 | | SLR75 | 3357 | | SLR76 | 7136 | | SLR77 | 5587 | | SLR78 | 4272 | | SLR79 | 4400 | | SLR80 | 2530 | | SLR86 | 3583 | | SLR32 | 9821 | | SLR52 | 185293 | | SLR53 | 218703 | | SLR54 | 157905 | | SLR83 | 17877 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Each dataset is distributed under Creative Commons Attribution-ShareAlike 4.0 International Public License ([CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)). See https://github.com/google/language-resources#license or the resource page on [OpenSLR](https://openslr.org/resources.php) for more information. ### Citation Information #### SLR32 ``` @inproceedings{van-niekerk-etal-2017, title = {{Rapid development of TTS corpora for four South African languages}}, author = {Daniel van Niekerk and Charl van Heerden and Marelie Davel and Neil Kleynhans and Oddur Kjartansson and Martin Jansche and Linne Ha}, booktitle = {Proc. Interspeech 2017}, pages = {2178--2182}, address = {Stockholm, Sweden}, month = aug, year = {2017}, URL = {https://dx.doi.org/10.21437/Interspeech.2017-1139} } ``` #### SLR35, SLR36, SLR52, SLR53, SLR54 ``` @inproceedings{kjartansson-etal-sltu2018, title = {{Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali}}, author = {Oddur Kjartansson and Supheakmungkol Sarin and Knot Pipatsrisawat and Martin Jansche and Linne Ha}, booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)}, year = {2018}, address = {Gurugram, India}, month = aug, pages = {52--55}, URL = {https://dx.doi.org/10.21437/SLTU.2018-11}, } ``` #### SLR41, SLR42, SLR43, SLR44 ``` @inproceedings{kjartansson-etal-tts-sltu2018, title = {{A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese}}, author = {Keshan Sodimana and Knot Pipatsrisawat and Linne Ha and Martin Jansche and Oddur Kjartansson and Pasindu De Silva and Supheakmungkol Sarin}, booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)}, year = {2018}, address = {Gurugram, India}, month = aug, pages = {66--70}, URL = {https://dx.doi.org/10.21437/SLTU.2018-14} } ``` #### SLR63, SLR64, SLR65, SLR66, SLR78, SLR79 ``` @inproceedings{he-etal-2020-open, title = {{Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems}}, author = {He, Fei and Chu, Shan-Hui Cathy and Kjartansson, Oddur and Rivera, Clara and Katanova, Anna and Gutkin, Alexander and Demirsahin, Isin and Johny, Cibu and Jansche, Martin and Sarin, Supheakmungkol and Pipatsrisawat, Knot}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, pages = {6494--6503}, url = {https://www.aclweb.org/anthology/2020.lrec-1.800}, ISBN = "{979-10-95546-34-4}, } ``` #### SLR69, SLR76, SLR77 ``` @inproceedings{kjartansson-etal-2020-open, title = {{Open-Source High Quality Speech Datasets for Basque, Catalan and Galician}}, author = {Kjartansson, Oddur and Gutkin, Alexander and Butryna, Alena and Demirsahin, Isin and Rivera, Clara}, booktitle = {Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)}, year = {2020}, pages = {21--27}, month = may, address = {Marseille, France}, publisher = {European Language Resources association (ELRA)}, url = {https://www.aclweb.org/anthology/2020.sltu-1.3}, ISBN = {979-10-95546-35-1}, } ``` #### SLR70, SLR71, SLR72, SLR73, SLR74, SLR75 ``` @inproceedings{guevara-rukoz-etal-2020-crowdsourcing, title = {{Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech}}, author = {Guevara-Rukoz, Adriana and Demirsahin, Isin and He, Fei and Chu, Shan-Hui Cathy and Sarin, Supheakmungkol and Pipatsrisawat, Knot and Gutkin, Alexander and Butryna, Alena and Kjartansson, Oddur}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)}, year = {2020}, month = may, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, url = {https://www.aclweb.org/anthology/2020.lrec-1.801}, pages = {6504--6513}, ISBN = {979-10-95546-34-4}, } ``` #### SLR80 ``` @inproceedings{oo-etal-2020-burmese, title = {{Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech}}, author = {Oo, Yin May and Wattanavekin, Theeraphol and Li, Chenfang and De Silva, Pasindu and Sarin, Supheakmungkol and Pipatsrisawat, Knot and Jansche, Martin and Kjartansson, Oddur and Gutkin, Alexander}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)}, month = may, year = {2020}, pages = "6328--6339", address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, url = {https://www.aclweb.org/anthology/2020.lrec-1.777}, ISBN = {979-10-95546-34-4}, } ``` #### SLR86 ``` @inproceedings{gutkin-et-al-yoruba2020, title = {{Developing an Open-Source Corpus of Yoruba Speech}}, author = {Alexander Gutkin and I{\c{s}}{\i}n Demir{\c{s}}ahin and Oddur Kjartansson and Clara Rivera and K\d{\'o}lá Túb\d{\`o}sún}, booktitle = {Proceedings of Interspeech 2020}, pages = {404--408}, month = {October}, year = {2020}, address = {Shanghai, China}, publisher = {International Speech and Communication Association (ISCA)}, doi = {10.21437/Interspeech.2020-1096}, url = {https://dx.doi.org/10.21437/Interspeech.2020-1096}, } ``` ### Contributions Thanks to [@cahya-wirawan](https://github.com/cahya-wirawan) for adding this dataset.

## OpenSLR 数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准测试榜](#supported-tasks-and-leaderboards) - [涉及语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [注释信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ### 数据集元数据数据集名称：OpenSLR 注释创建者：公开采集语言数据创建者：公开采集涉及语言： - 南非荷兰语（af） - 孟加拉语（bn） - 加泰罗尼亚语（ca） - 英语（en） - 西班牙语（es） - 巴斯克语（eu） - 加利西亚语（gl） - 古吉拉特语（gu） - 爪哇语（jv） - 高棉语（km） - 卡纳达语（kn） - 马拉雅拉姆语（ml） - 马拉地语（mr） - 缅甸语（my） - 尼泊尔语（ne） - 僧伽罗语（si） - 塞索托语（st） - 巽他语（su） - 泰米尔语（ta） - 泰卢固语（te） - 茨瓦纳语（tn） - 温达语（ve） - 科萨语（xh） - 约鲁巴语（yo） BCP47语言标签： - en-GB（英国英语） - en-IE（爱尔兰英语） - en-NG（尼日利亚英语） - es-CL（智利西班牙语） - es-CO（哥伦比亚西班牙语） - es-PE（秘鲁西班牙语） - es-PR（波多黎各西班牙语）许可协议：知识共享署名-相同方式共享4.0国际公共许可协议（CC-BY-SA-4.0）多语言属性：多语言样本规模区间：1000 < 样本数 < 10000 源数据集：原创数据集任务类别：自动语音识别（Automatic Speech Recognition）任务子项：无 PapersWithCode编号：无 ### 数据集详情各配置的详细信息如下： #### SLR41 特征字段： - `path`：字符串类型，音频文件路径 - `audio`：音频类型，采样率为48000 Hz - `sentence`：字符串类型，音频转录文本划分：训练集，样本数5822，字节数2423902 下载大小：1890792360，数据集大小：2423902 （其余SLR配置按照相同格式翻译，此处省略重复内容）可用配置：SLR32、SLR35、SLR36、SLR41、SLR42、SLR43、SLR44、SLR52、SLR53、SLR54、SLR63、SLR64、SLR65、SLR66、SLR69、SLR70、SLR71、SLR72、SLR73、SLR74、SLR75、SLR76、SLR77、SLR78、SLR79、SLR80、SLR83、SLR86 ## 数据集描述 - **主页**：https://www.openslr.org/ - **代码仓库**：[待补充更多信息] - **相关论文**：[待补充更多信息] - **基准测试榜**：[待补充更多信息] - **联系方式**：[待补充更多信息] ### 数据集摘要 OpenSLR 是一个致力于托管语音与语言资源的平台，涵盖语音识别训练语料库以及相关语音识别软件。当前已上线的资源如下： #### SLR32：南非四种语言的高质量文本转语音（Text-to-Speech, TTS）数据本数据集包含面向南非四种语言的多说话人高质量转录音频数据。数据集由波形音频文件与转录TSV（制表符分隔值）文件组成。各文件夹内的`line_index.tsv`文件包含文件ID，该ID同时对应发言者用户ID与对应音频的转录文本。本数据集已通过初步质量校验，但仍可能存在疏漏。本数据集由西北大学与谷歌合作收集。本数据集采用CC-BY-SA-4.0许可协议分发，详见https://github.com/google/language-resources#license了解详情。版权所有 2017 Google 公司。 #### SLR35：大型爪哇语自动语音识别（Automatic Speech Recognition, ASR）训练数据集本数据集包含约18.5万条爪哇语转录音频片段，由波形音频文件与TSV文件组成。`utt_spk_text.tsv`文件包含文件ID、用户ID与音频转录文本。本数据集已通过人工质量校验，但仍可能存在错误。本数据集由谷歌与冰岛雷克雅未克大学、印度加札马达大学合作收集。本数据集采用CC-BY-SA-4.0许可协议分发，详见[LICENSE](https://www.openslr.org/resources/35/LICENSE)与https://github.com/google/language-resources#license。版权所有 2016、2017 Google 公司。（其余SLR资源的详细描述按照原文格式逐一翻译，此处省略重复内容） ### 支持任务与基准测试榜 [待补充更多信息] ### 涉及语言爪哇语、高棉语、尼泊尔语、巽他语、马拉雅拉姆语、马拉地语、泰米尔语、泰卢固语、加泰罗尼亚语、尼日利亚英语、智利西班牙语、哥伦比亚西班牙语、秘鲁西班牙语、波多黎各西班牙语、委内瑞拉西班牙语、巴斯克语、加利西亚语、古吉拉特语、卡纳达语、南非荷兰语、塞索托语、茨瓦纳语与科萨语。 ## 数据集结构 ### 数据实例典型数据点包含音频文件路径`path`与对应转录文本`sentence`。以下为示例格式： json { "path": "/home/cahya/.cache/huggingface/datasets/downloads/extracted/4d9cf915efc21110199074da4d492566dee6097068b07a680f670fcec9176e62/su_id_female/wavs/suf_00297_00037352660.wav", "audio": { "path": "/home/cahya/.cache/huggingface/datasets/downloads/extracted/4d9cf915efc21110199074da4d492566dee6097068b07a680f670fcec9176e62/su_id_female/wavs/suf_00297_00037352660.wav", "array": [-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], "sampling_rate": 16000 }, "sentence": "Panonton ting haruleng ningali Kelly Clarkson keur nyanyi di tipi" } ### 数据字段 - `path`：音频文件的存储路径。 - `audio`：包含音频文件路径、解码后的浮点数组与采样率的字典。请注意，访问`dataset[0]["audio"]`时，系统会自动解码音频文件并重采样至`dataset.features["audio"].sampling_rate`指定的采样率。对大量音频文件进行解码与重采样耗时较长，建议优先通过样本索引访问音频列，即始终使用`dataset[0]["audio"]`而非`dataset["audio"][0]`。 - `sentence`：用户朗读时的提示文本，即音频对应的转录文本。 ### 数据划分所有配置仅包含一个`train`（训练集）划分，各配置的样本数量如下表所示： | 配置编号 | 样本数量 | |----------|----------| | SLR41 | 5822 | | SLR42 | 2906 | | SLR43 | 2064 | | SLR44 | 4213 | | SLR63 | 4126 | | SLR64 | 1569 | | SLR65 | 4284 | | SLR66 | 4448 | | SLR69 | 4240 | | SLR35 | 185076 | | SLR36 | 219156 | | SLR70 | 3359 | | SLR71 | 4374 | | SLR72 | 4903 | | SLR73 | 5447 | | SLR74 | 617 | | SLR75 | 3357 | | SLR76 | 7136 | | SLR77 | 5587 | | SLR78 | 4272 | | SLR79 | 4400 | | SLR80 | 2530 | | SLR86 | 3583 | | SLR32 | 9821 | | SLR52 | 185293 | | SLR53 | 218703 | | SLR54 | 157905 | | SLR83 | 17877 | ## 数据集构建 ### 遴选依据 [待补充更多信息] ### 源数据 #### 初始数据收集与标准化 [待补充更多信息] #### 语言数据生产者 [待补充更多信息] ### 注释信息 #### 注释流程 [待补充更多信息] #### 注释者 [待补充更多信息] ### 个人与敏感信息本数据集由自愿在线贡献语音的人群构成。请承诺不会尝试识别数据集中的发言者身份。 ## 数据集使用注意事项 ### 数据集的社会影响 [待补充更多信息] ### 偏差分析 [待补充更多信息] ### 其他已知局限 [待补充更多信息] ## 附加信息 ### 数据集策展人 [待补充更多信息] ### 授权信息所有数据集均采用知识共享署名-相同方式共享4.0国际公共许可协议（CC-BY-SA-4.0）分发。详见https://github.com/google/language-resources#license或OpenSLR官网资源页面https://openslr.org/resources.php了解更多信息。 ### 引用信息（所有引用条目按照学术规范翻译，此处省略完整内容） ### 贡献感谢 [@cahya-wirawan](https://github.com/cahya-wirawan) 为本数据集的收录提供支持。

提供机构：

openslr

原始信息汇总

数据集概述

数据集名称: OpenSLR

语言:

支持多种语言，包括但不限于：af, bn, ca, en, es, eu, gl, gu, jv, km, kn, ml, mr, my, ne, si, st, su, ta, te, tn, ve, xh, yo
BCP47语言代码示例: en-GB, en-IE, en-NG, es-CL, es-CO, es-PE, es-PR

许可证: cc-by-sa-4.0

多语言支持: 多语言

大小分类: 1K<n<10K

源数据集: 原始数据

任务类别: 自动语音识别

数据集结构

数据实例

特征:
- path: 数据路径，数据类型为字符串。
- audio: 音频数据，采样率为48000，数据类型为音频。
- sentence: 句子文本，数据类型为字符串。

数据分割

训练集:
- SLR41: 5822个样本，数据大小为2423902字节。
- SLR42: 2906个样本，数据大小为1427984字节。
- SLR43: 2064个样本，数据大小为1074005字节。
- SLR44: 4213个样本，数据大小为1776827字节。
- SLR63: 4126个样本，数据大小为2016587字节。
- SLR64: 1569个样本，数据大小为810375字节。
- SLR65: 4284个样本，数据大小为2136447字节。
- SLR66: 4448个样本，数据大小为1898335字节。
- SLR69: 4240个样本，数据大小为1647263字节。
- SLR35: 185076个样本，数据大小为73565374字节。
- SLR36: 219156个样本，数据大小为88942337字节。
- SLR70: 3359个样本，数据大小为1339608字节。
- SLR71: 4374个样本，数据大小为1676273字节。
- SLR72: 4903个样本，数据大小为1876301字节。
- SLR73: 5447个样本，数据大小为2084052字节。
- SLR74: 617个样本，数据大小为237395字节。
- SLR75: 3357个样本，数据大小为1286937字节。
- SLR76: 7136个样本，数据大小为2756507字节。
- SLR77: 5587个样本，数据大小为2217652字节。
- SLR78: 4272个样本，数据大小为2121986字节。
- SLR79: 4400个样本，数据大小为2176539字节。
- SLR80: 2530个样本，数据大小为1308651字节。
- SLR86: 3583个样本，数据大小为1378801字节。
- SLR32: 9821个样本，数据大小为4544052380字节。
- SLR52: 185293个样本，数据大小为77369899字节。
- SLR53: 218703个样本，数据大小为88073248字节。
- SLR54: 157905个样本，数据大小为62735822字节。
- SLR83: 17877个样本，数据大小为7098985字节。

数据集配置名称

SLR32, SLR35, SLR36, SLR41, SLR42, SLR43, SLR44, SLR52, SLR53, SLR54, SLR63, SLR64, SLR65, SLR66, SLR69, SLR70, SLR71, SLR72, SLR73, SLR74, SLR75, SLR76, SLR77, SLR78, SLR79, SLR80, SLR83, SLR86

搜集汇总

数据集介绍

构建方式

OpenSLR数据集的构建基于多语言和多领域的语音资源，涵盖了从高音质文本到语音识别训练数据等多种类型。数据集的收集过程涉及多个合作机构，如Google、North West University、Gadjah Mada University等，通过协作确保数据的高质量和多样性。每个子数据集（如SLR32、SLR35等）都包含波形文件和相应的文本转录文件，这些文件经过手动质量检查，尽管可能仍存在少量错误。数据集的构建旨在支持语音识别和文本到语音转换等任务，为研究者和开发者提供丰富的训练资源。

特点

OpenSLR数据集的主要特点在于其多语言和多领域的覆盖，支持包括南非语、爪哇语、高棉语、马拉雅拉姆语等多种语言。此外，数据集的高质量转录和音频文件，以及详细的元数据信息，使其成为语音识别和文本到语音转换研究的宝贵资源。数据集的多样性不仅体现在语言上，还包括不同类型的语音数据，如多说话者的高质量录音和大规模的自动语音识别训练数据。这些特点使得OpenSLR成为跨语言和跨领域语音技术研究的重要基石。

使用方法

使用OpenSLR数据集时，用户首先需要根据具体任务选择合适的子数据集，如SLR32、SLR35等。每个子数据集包含波形音频文件和对应的文本转录文件，用户可以通过这些文件进行语音识别或文本到语音转换的模型训练。数据集的结构清晰，便于数据加载和处理。用户可以通过HuggingFace等平台直接访问和下载数据集，利用Python等编程语言进行数据预处理和模型训练。此外，数据集的许可证允许用户在遵守Creative Commons Attribution-ShareAlike 4.0 International Public License的前提下自由使用和分发数据。

背景与挑战

背景概述

OpenSLR数据集是一个专注于托管语音和语言资源的平台，旨在为语音识别训练提供丰富的语料库及相关软件。该数据集由Google等机构合作创建，涵盖了多种语言的高质量转录音频数据，包括南非语、爪哇语、高棉语、马拉雅拉姆语等。其核心研究问题在于如何通过大规模、多语言的语音数据集来提升自动语音识别（ASR）和文本到语音合成（TTS）系统的性能。自创建以来，OpenSLR已成为语音识别领域的重要资源，推动了多语言语音技术的研究与应用。

当前挑战

OpenSLR数据集在构建过程中面临多项挑战。首先，多语言数据的收集和转录需要跨越不同文化和语言背景，确保数据的质量和一致性是一大难题。其次，数据集的规模和多样性要求高效的存储和处理技术，以应对海量音频文件的管理和分析。此外，确保数据隐私和安全，特别是在涉及个人语音数据时，也是一项重要挑战。最后，如何持续更新和扩展数据集，以反映语言和语音技术的最新发展，是保持其前沿性和实用性的关键。

常用场景

经典使用场景

OpenSLR数据集在语音识别和文本转语音（TTS）领域具有广泛的应用。其经典使用场景包括训练多语言语音识别模型，通过丰富的音频数据和对应的转录文本，提升模型对不同语言和方言的识别能力。此外，该数据集还可用于开发高质量的TTS系统，通过多说话人的音频数据，增强合成语音的自然度和多样性。

实际应用

在实际应用中，OpenSLR数据集被广泛用于开发智能语音助手、语音翻译系统和语音识别软件。例如，通过训练基于该数据集的模型，企业可以构建支持多种语言的语音识别系统，提升用户体验。此外，该数据集还可用于教育、医疗和客服等领域，通过语音技术提高服务效率和质量。

衍生相关工作

基于OpenSLR数据集，研究者们开展了多项经典工作，如多语言语音识别模型的优化、跨语言语音合成技术的研究等。这些工作不仅提升了语音识别和TTS的性能，还推动了相关领域的技术进步。例如，有研究利用该数据集开发了能够自动适应不同语言和方言的语音识别系统，显著提高了系统的鲁棒性和适用性。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集