thennal/GMaSC
收藏Hugging Face2023-05-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/thennal/GMaSC
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: speaker
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 48000
splits:
- name: train
num_bytes: 717976082.0
num_examples: 2000
download_size: 797772747
dataset_size: 717976082.0
annotations_creators:
- expert-generated
language:
- ml
language_creators:
- found
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
pretty_name: GEC Barton Hill Malayalam Speech Corpus
size_categories:
- 1K<n<10K
source_datasets:
- original
tags: []
task_categories:
- text-to-speech
- automatic-speech-recognition
task_ids: []
---
# GMaSC: GEC Barton Hill Malayalam Speech Corpus
**GMaSC** is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech.
## Dataset Structure
The dataset consists of 2,000 instances with fields `text`, `speaker`, and `audio`. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info:
| Speaker | Gender | Age | Time (HH:MM:SS) | Sentences |
| --- | --- | --- | --- | --- |
| Sonia | Female | 43 | 01:02:17 | 1,000 |
| Anil | Male | 48 | 01:17:23 | 1,000 |
| **Total** | | | **02:19:40** | **2,000** |
### Data Instances
An example instance is given below:
```json
{'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്',
'speaker': 'Sonia',
'audio': {'path': None,
'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553,
0.00094604]),
'sampling_rate': 48000}}
```
### Data Fields
- **text** (str): Transcription of the audio file
- **speaker** (str): The name of the speaker
- **audio** (dict): Audio object including loaded audio array, sampling rate and path to audio (always None)
### Data Splits
We provide all the data in a single `train` split. The loaded dataset object thus looks like this:
```json
DatasetDict({
train: Dataset({
features: ['text', 'speaker', 'audio'],
num_rows: 2000
})
})
```
## Additional Information
### Licensing
The corpus is made available under the [Creative Commons license (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
提供机构:
thennal
原始信息汇总
GEC Barton Hill Malayalam Speech Corpus 数据集概述
数据集基本信息
- 名称: GEC Barton Hill Malayalam Speech Corpus
- 简称: GMaSC
- 语言: 马拉雅拉姆语(Malayalam)
- 多语言性: 单语种
- 数据集大小: 1,000 < n < 10,000 实例
- 创建者: 专家生成
- 许可: Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)
数据集内容
- 实例数量: 2,000
- 特征:
- text (字符串): 音频文件的转录
- speaker (字符串): 说话人名称
- audio (字典): 包含音频数组、采样率和音频路径(始终为None)
- 音频特性:
- 采样率: 48,000 Hz
- 说话人信息:
- Sonia (女性, 43岁): 1,000句, 时长1小时2分17秒
- Anil (男性, 48岁): 1,000句, 时长1小时17分23秒
- 总计: 2,000句, 时长2小时19分40秒
数据集结构
- 数据分割: 单一训练集(train)
- 数据集对象结构: json DatasetDict({ train: Dataset({ features: [text, speaker, audio], num_rows: 2000 }) })
任务类别
- 文本到语音转换
- 自动语音识别



