five

MLRS/masri_synthetic

收藏
Hugging Face2024-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MLRS/masri_synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - machine-generated language: - mt license: cc-by-nc-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - automatic-speech-recognition task_ids: [] pretty_name: 'MASRI-SYNTHETIC: Synthetized Speech with Transcriptions in Maltese.' tags: - masri - maltese - masri-project - malta - synthetic speech - tts dataset_info: config_name: masri_synthetic features: - name: audio_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: gender dtype: string - name: duration dtype: float32 - name: speech_rate dtype: string - name: pitch dtype: string - name: normalized_text dtype: string splits: - name: train num_bytes: 6538380561.5 num_examples: 52500 download_size: 6535598074 dataset_size: 6538380561.5 configs: - config_name: masri_synthetic data_files: - split: train path: masri_synthetic/train-* default: true --- # Dataset Card for masri_synthetic ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [MASRI Project](https://www.um.edu.mt/projects/masri/) - **Repository:** [MASRI Data Repo](https://github.com/UMSpeech/) - **Repository:** [LDC](https://catalog.ldc.upenn.edu/LDC2022S08) - **Paper:** [Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective](https://www.um.edu.mt/library/oar/bitstream/123456789/92466/1/Data_Augmentation_for_Speech_Recognition_in_Maltese_A_Low_Resource_Perspective%282021%29.pdf) - **Paper:** [Analysis of Data Augmentation Methods for Low-Resource Maltese ASR](https://arxiv.org/pdf/2111.07793.pdf) ### Dataset Summary The MASRI-SYNTHETIC is a corpus made out of synthesized speech in Maltese. The text-to-speech (TTS) system utilized to produce the utterances was developed by the Research & Development Department of Crimsonwing p.l.c. The sentences used to create the corpus were extracted from the [MLRS Corpus](https://mlrs.research.um.edu.mt/index.php?page=corpora), which is a corpus of written or transcribed Maltese divided into different genres, including: culture, news, academic, religion, sports, etc. [MASRI](https://www.um.edu.mt/projects/masri/) stands for "Maltese Automatic Speech Recognition I". [MASRI](https://www.um.edu.mt/projects/masri/) is a project at the [University of Malta](https://www.um.edu.mt/), funded by the University of Malta Research Fund Award Scheme. ### Example Usage The MASRI-SYNTHETIC contains the train split only: ```python from datasets import load_dataset masri_synthetic = load_dataset("MLRS/masri_synthetic") ``` It is also valid to do: ```python from datasets import load_dataset masri_synthetic = load_dataset("MLRS/masri_synthetic",split="train") ``` ### Supported Tasks automatic-speech-recognition: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). ### Languages The language of the corpus is Maltese. ## Dataset Structure ### Data Instances ```python { 'audio_id': 'MSRSY_F_0042_RN01PP10_0143', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/17d8c60020489a5a43ba0cf322ed7c121375915c671b57fbdb03950befbd1a9c/female/F_0042_RN01PP10/MSRSY_F_0042_RN01PP10_0143.flac', 'array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_0042', 'gender': 'female', 'duration': 9.0, 'speech_rate': '-01', 'pitch': '+10', 'normalized_text': "il-poplu b' pakkett ta' negozjati f' id-direttur ġenerali tal-uffiċċju tal-pubblikazzjonijiet uffiċjali għall-komunitajiet ewropej" } ``` ### Data Fields * `audio_id` (string) - id of audio segment * `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally). * `speaker_id` (string) - id of the synthetic voice * `gender` (string) - gender of synthetic voice (male or female) * `duration` (float32) - duration of the audio file in seconds. * `speech_rate` (string) - speed rate that goes from -2 to +2. * `pitch` (string) - the pitch goes from -10 to +10. * `normalized_text` (string) - normalized audio segment transcription ### Data Splits The corpus counts just with the train split which has a total of 52500 speech files from 105 male and 105 female voices with a total duration of 99 hours and 18 minutes. ## Dataset Creation ### Curation Rationale The MASRI-SYNTHETIC CORPUS (MSYC) has the following characteristics: * The MSYC has an exact duration of 99 hours and 18 minutes. It has 52500 audio files. * The MSYC has recordings from 210 different voices: 105 men and 105 female voices. * Voices were produced when varying between 21 values of pitch (-10 to +10) and 5 values of speech rate (-2 to 2). * Data in MSYC is classified by voice. It means, all the utterances belonging to one single voice are stored in one single directory. * Data is also classified according to the gender (male/female) of the voice. * Each voice has assigned 250 utterances of 13 words each. * Every audio file in the MSYC has a duration between 2 and 10 seconds approximately. * Audio files in the MSYC are distributed in a 16khz@16bit mono format. * Transcriptions in MSYC are in lowercase. No punctuation marks are permitted except for dashes (-) and apostrophes (') due to their importance in Maltese orthography. * Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. ### Source Data #### Initial Data Collection and Normalization The MASRI-SYNTHETIC CORPUS was possible thanks to the text-to-speech (TTS) system developed by the Research & Development Department of Crimsonwing p.l.c. The sentences used to create the corpus were extracted from the [MLRS Corpus](https://mlrs.research.um.edu.mt/index.php?page=corpora). ### Annotations #### Annotation process Text sentences from the platform [MLRS Corpus](https://mlrs.research.um.edu.mt/index.php?page=corpora) were selected to create synthetic utterances with them. The MASRI-SYNTHETIC is comprised of synthetic utterances only. #### Who are the annotators? The authors selected the sentences to be synthesized. ### Personal and Sensitive Information The corpus is comprised of synthetic speech utterances from a TTS system. No personal or sensitive information is shared. ## Considerations for Using the Data ### Social Impact of Dataset The MASRI-SYNTHETIC CORPUS is the only Maltese corpus at the moment, that counts with synthetic speech and it is publicly available under a CC-BY-NC-SA-4.0 license. ### Discussion of Biases * Sentences from [MLRS]((https://mlrs.research.um.edu.mt/index.php?page=corpora)) are put in a single plain text file. The text includes punctuation marks. * To facilitate the text processing, sentences are split to fit into lines with 30 words only. * Punctuation marks and sentences including not UTF-8 characters are removed. * Sentences with foreign words and proper names were removed. * As the letters "c" and "y" do not really belong to the Maltese alphabet, sentences including words with any of those letters were removed. This is done to ensure that only Maltese words will be included in each sentence. * Using Python, the resulting sentences are now put into a simple list; so, each element is a word. * Each word of the list is now taken one by one to produce text lines of exactly 13 words. This process only generated 27714 sentences of the 52500 that constitute the whole corpus. * To produce the remaining sentences, the words of the list were shuffled and the process in the previous point were repeated until we got the 52500 sentences needed by the corpus. * At the end, the produced sentences were converted into utterances using the TTS system. ### Other Known Limitations The MASRI team does not guarantee the accuracy of this corpus, nor its suitability for any specific purpose. In fact, we expect a number of errors, omissions and inconsistencies to remain in the corpus. ### Dataset Curators The speech sentences were selected and synthesized by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) at the [University of Malta](https://www.um.edu.mt/) in the Msida Campus during June, 2020. ### Licensing Information [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) ### Citation Information ``` @misc{carlosmenamasrisynthetic2020, title={MASRI-SYNTHETIC: Synthetized Speech with Transcriptions in Maltese.}, author={Hernandez Mena, Carlos Daniel and Gatt, Albert and DeMarco, Andrea and Borg, Claudia and van der Plas, Lonneke}, journal={MASRI Project, Malta}, year={2020}, url={https://huggingface.co/datasets/MLRS/masri_synthetic}, } ``` The MASRI-SYNTHETIC was also published at [LDC](https://catalog.ldc.upenn.edu/LDC2022S08) in 2022. ### Contributions The authors would like to thank to KPMG Microsoft Business Solutions (formerly CrimsonWing) for providing the TTS system used in our experiments. For more information about the CrimsonWing TTS system see [this presentation](https://pdfs.semanticscholar.org/5e5a/25e34b3c351ba0e58211a5192535e9ddea06.pdf). We also want to thant to the University of Malta Research Fund Award Scheme for making this project possible.
提供机构:
MLRS
原始信息汇总

数据集概述

数据集名称

  • 名称: MASRI-SYNTHETIC
  • 别名: MASRI-SYNTHETIC: Synthetized Speech with Transcriptions in Maltese.

数据集描述

  • 摘要: MASRI-SYNTHETIC是一个由马耳他语合成语音组成的语料库,使用文本到语音(TTS)系统生成。该语料库用于自动语音识别(ASR)任务,主要评估指标为词错误率(WER)。
  • 语言: 马耳他语
  • 许可证: CC-BY-NC-SA-4.0

数据集结构

  • 数据实例: 每个实例包含音频ID、音频文件路径、音频数组、采样率、说话者ID、性别、持续时间、语速、音高和标准化文本转录。
  • 数据字段: 包括音频ID、音频信息、说话者ID、性别、持续时间、语速、音高和标准化文本。
  • 数据分割: 仅包含训练集,共有52500个语音文件,总时长99小时18分钟。

数据集创建

  • 来源数据: 使用Crimsonwing p.l.c.开发的TTS系统,从MLRS语料库中提取句子生成合成语音。
  • 注释: 由作者从MLRS语料库中选择句子进行合成。
  • 个人和敏感信息: 数据集由合成语音组成,不包含个人或敏感信息。

使用考虑

  • 社会影响: 目前唯一公开的马耳他语合成语音语料库。
  • 偏见讨论: 数据处理过程中移除了特定字符和非马耳他语单词,可能影响语料库的代表性。
  • 其他已知限制: 数据集可能包含错误、遗漏和不一致,不保证其准确性和适用性。

附加信息

  • 数据集创建者: Carlos Daniel Hernández Mena, 马耳他大学
  • 许可证信息: CC-BY-NC-SA-4.0
  • 引用信息: 参见提供的引用信息。
  • 贡献: 感谢KPMG Microsoft Business Solutions和马耳他大学研究基金奖计划的支持。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作