thennal/GMaSC

Name: thennal/GMaSC
Creator: thennal
Published: 2023-05-01 21:18:33
License: 暂无描述

Hugging Face2023-05-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/thennal/GMaSC

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: speaker dtype: string - name: audio dtype: audio: sampling_rate: 48000 splits: - name: train num_bytes: 717976082.0 num_examples: 2000 download_size: 797772747 dataset_size: 717976082.0 annotations_creators: - expert-generated language: - ml language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: GEC Barton Hill Malayalam Speech Corpus size_categories: - 1K<n<10K source_datasets: - original tags: [] task_categories: - text-to-speech - automatic-speech-recognition task_ids: [] --- # GMaSC: GEC Barton Hill Malayalam Speech Corpus **GMaSC** is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech. ## Dataset Structure The dataset consists of 2,000 instances with fields `text`, `speaker`, and `audio`. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info: | Speaker | Gender | Age | Time (HH:MM:SS) | Sentences | | --- | --- | --- | --- | --- | | Sonia | Female | 43 | 01:02:17 | 1,000 | | Anil | Male | 48 | 01:17:23 | 1,000 | | **Total** | | | **02:19:40** | **2,000** | ### Data Instances An example instance is given below: ```json {'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്', 'speaker': 'Sonia', 'audio': {'path': None, 'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553, 0.00094604]), 'sampling_rate': 48000}} ``` ### Data Fields - **text** (str): Transcription of the audio file - **speaker** (str): The name of the speaker - **audio** (dict): Audio object including loaded audio array, sampling rate and path to audio (always None) ### Data Splits We provide all the data in a single `train` split. The loaded dataset object thus looks like this: ```json DatasetDict({ train: Dataset({ features: ['text', 'speaker', 'audio'], num_rows: 2000 }) }) ``` ## Additional Information ### Licensing The corpus is made available under the [Creative Commons license (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).

提供机构：

thennal

原始信息汇总

GEC Barton Hill Malayalam Speech Corpus 数据集概述

数据集基本信息

名称: GEC Barton Hill Malayalam Speech Corpus
简称: GMaSC
语言: 马拉雅拉姆语（Malayalam）
多语言性: 单语种
数据集大小: 1,000 < n < 10,000 实例
创建者: 专家生成
许可: Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)

数据集内容

实例数量: 2,000
特征:
- text (字符串): 音频文件的转录
- speaker (字符串): 说话人名称
- audio (字典): 包含音频数组、采样率和音频路径（始终为None）
音频特性:
- 采样率: 48,000 Hz
说话人信息:
- Sonia (女性, 43岁): 1,000句, 时长1小时2分17秒
- Anil (男性, 48岁): 1,000句, 时长1小时17分23秒
- 总计: 2,000句, 时长2小时19分40秒

数据集结构

数据分割: 单一训练集（train）
数据集对象结构: json DatasetDict({ train: Dataset({ features: [text, speaker, audio], num_rows: 2000 }) })

任务类别

文本到语音转换
自动语音识别

5,000+

优质数据集

54 个

任务类型

进入经典数据集