ylacombe/google-tamil

Name: ylacombe/google-tamil
Creator: ylacombe
Published: 2023-11-27 11:37:22
License: 暂无描述

Hugging Face2023-11-27 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ylacombe/google-tamil

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: female features: - name: audio dtype: audio - name: text dtype: string - name: speaker_id dtype: int64 splits: - name: train num_bytes: 1364555763.88 num_examples: 2335 download_size: 1006094564 dataset_size: 1364555763.88 - config_name: male features: - name: audio dtype: audio - name: text dtype: string - name: speaker_id dtype: int64 splits: - name: train num_bytes: 1064641765.528 num_examples: 1956 download_size: 781072069 dataset_size: 1064641765.528 configs: - config_name: female data_files: - split: train path: female/train-* - config_name: male data_files: - split: train path: male/train-* license: cc-by-sa-4.0 task_categories: - text-to-speech - text-to-audio language: - ta pretty_name: Tamil Speech --- # Dataset Card for Tamil Speech ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Statistics](#data-statistics) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Crowdsourced high-quality Tamil multi-speaker speech data set.](https://www.openslr.org/65/) - **Repository:** [Google Language Resources and Tools](https://github.com/google/language-resources) - **Paper:** [Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems](https://aclanthology.org/2020.lrec-1.804/) ### Dataset Summary This dataset consists of 7 hours of transcribed high-quality audio of Tamil sentences recorded by 50 volunteers. The dataset is intended for speech technologies. The data archives were restructured from the original ones from [OpenSLR](http://www.openslr.org/65/) to make it easier to stream. ### Supported Tasks - `text-to-speech`, `text-to-audio`: The dataset can be used to train a model for Text-To-Speech (TTS). - `automatic-speech-recognition`, `speaker-identification`: The dataset can also be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). ### How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the female config, simply specify the corresponding language config name (i.e., "female" for female speakers): ```python from datasets import load_dataset dataset =load_dataset("ylacombe/google-tamil", "female", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset dataset =load_dataset("ylacombe/google-tamil", "female", split="train", streaming=True) print(next(iter(dataset))) ``` #### *Bonus* You can create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). **Local:** ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler dataset =load_dataset("ylacombe/google-tamil", "female", split="train") batch_sampler = BatchSampler(RandomSampler(dataset), batch_size=32, drop_last=False) dataloader = DataLoader(dataset, batch_sampler=batch_sampler) ``` **Streaming:** ```python from datasets import load_dataset from torch.utils.data import DataLoader dataset =load_dataset("ylacombe/google-tamil", "female", split="train", streaming=True) dataloader = DataLoader(dataset, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file called `audio` and its transcription, called `text`. Some additional information about the speaker and the passage which contains the transcription is provided. ``` {'audio': {'path': 'taf_02345_00348037167.wav', 'array': array([-9.15527344e-05, -9.15527344e-05, -1.22070312e-04, ..., -3.05175781e-05, 0.00000000e+00, 3.05175781e-05]), 'sampling_rate': 48000}, 'text': 'ஆஸ்த்ரேலியப் பெண்ணுக்கு முப்பத்தி மூன்று ஆண்டுகளுக்குப் பின்னர் இந்தியா இழப்பீடு வழங்கியது', 'speaker_id': 2345} ``` ### Data Fields - audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. - text: the transcription of the audio file. - speaker_id: unique id of the speaker. The same speaker id can be found for multiple data samples. ### Data Statistics | | Total duration (h) | Average duration (s) | # speakers | # sentences | # total words | # unique words | # total syllables | # unique syllables | # total phonemes | # unique phonemes | |--------|--------------------|----------------------|------------|-------------|---------------|----------------|-------------------|--------------------|------------------|-------------------| | Female | 4.01 | 6.18 | 25 | 2,335 | 15,880 | 6,620 | 56,607 | 1,696 | 126,659 | 37 | | Male | 3.07 | 5.66 | 25 | 1,956 | 13,545 | 6,159 | 48,049 | 1,642 | 107,570 | 37 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information License: ([CC BY-SA 4.0 DEED](https://creativecommons.org/licenses/by-sa/4.0/deed.en)) ### Citation Information ``` @inproceedings{he-etal-2020-open, title = {{Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems}}, author = {He, Fei and Chu, Shan-Hui Cathy and Kjartansson, Oddur and Rivera, Clara and Katanova, Anna and Gutkin, Alexander and Demirsahin, Isin and Johny, Cibu and Jansche, Martin and Sarin, Supheakmungkol and Pipatsrisawat, Knot}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, pages = {6494--6503}, url = {https://www.aclweb.org/anthology/2020.lrec-1.800}, ISBN = "{979-10-95546-34-4}, } ``` ### Contributions Thanks to [@ylacombe](https://github.com/ylacombe) for adding this dataset.

This dataset consists of 7 hours of transcribed high-quality audio of Tamil sentences recorded by 50 volunteers, intended for speech technologies such as Text-To-Speech (TTS) and Automatic Speech Recognition (ASR). It is divided into female and male configurations, each containing audio, text, and speaker_id features. The data is sourced from OpenSLR and restructured for easier streaming. The dataset supports tasks like text-to-speech, text-to-audio, automatic-speech-recognition, and speaker-identification.

提供机构：

ylacombe

原始信息汇总

数据集描述

数据集概述

该数据集包含7小时的泰米尔语高质量音频，由50名志愿者录制，适用于语音技术。数据集分为两个配置：女性（female）和男性（male）。

支持的任务

text-to-speech
text-to-audio

数据集结构

数据实例

每个数据点包含音频文件路径（audio）、音频转录文本（text）和说话者ID（speaker_id）。

数据字段

audio: 包含音频文件名、解码后的音频数组和采样率。
text: 音频文件的转录文本。
speaker_id: 说话者的唯一ID。

数据统计

	总时长（小时）	平均时长（秒）	说话者数量	句子数量	总单词数	唯一单词数	总音节数	唯一音节数	总音素数	唯一音素数
女性	4.01	6.18	25	2,335	15,880	6,620	56,607	1,696	126,659	37
男性	3.07	5.66	25	1,956	13,545	6,159	48,049	1,642	107,570	37

数据集创建

个人和敏感信息

数据集包含在线捐赠语音的人。您同意不尝试确定此数据集中说话者的身份。

使用数据的注意事项

许可信息

许可证：CC BY-SA 4.0 DEED

引用信息

@inproceedings{he-etal-2020-open, title = {{Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems}}, author = {He, Fei and Chu, Shan-Hui Cathy and Kjartansson, Oddur and Rivera, Clara and Katanova, Anna and Gutkin, Alexander and Demirsahin, Isin and Johny, Cibu and Jansche, Martin and Sarin, Supheakmungkol and Pipatsrisawat, Knot}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, pages = {6494--6503}, url = {https://www.aclweb.org/anthology/2020.lrec-1.800}, ISBN = "{979-10-95546-34-4}, }

贡献

感谢@ylacombe添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集