five

thennal/GMaSC

收藏
Hugging Face2023-05-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/thennal/GMaSC
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: speaker dtype: string - name: audio dtype: audio: sampling_rate: 48000 splits: - name: train num_bytes: 717976082.0 num_examples: 2000 download_size: 797772747 dataset_size: 717976082.0 annotations_creators: - expert-generated language: - ml language_creators: - found license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: GEC Barton Hill Malayalam Speech Corpus size_categories: - 1K<n<10K source_datasets: - original tags: [] task_categories: - text-to-speech - automatic-speech-recognition task_ids: [] --- # GMaSC: GEC Barton Hill Malayalam Speech Corpus **GMaSC** is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech. ## Dataset Structure The dataset consists of 2,000 instances with fields `text`, `speaker`, and `audio`. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info: | Speaker | Gender | Age | Time (HH:MM:SS) | Sentences | | --- | --- | --- | --- | --- | | Sonia | Female | 43 | 01:02:17 | 1,000 | | Anil | Male | 48 | 01:17:23 | 1,000 | | **Total** | | | **02:19:40** | **2,000** | ### Data Instances An example instance is given below: ```json {'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്', 'speaker': 'Sonia', 'audio': {'path': None, 'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553, 0.00094604]), 'sampling_rate': 48000}} ``` ### Data Fields - **text** (str): Transcription of the audio file - **speaker** (str): The name of the speaker - **audio** (dict): Audio object including loaded audio array, sampling rate and path to audio (always None) ### Data Splits We provide all the data in a single `train` split. The loaded dataset object thus looks like this: ```json DatasetDict({ train: Dataset({ features: ['text', 'speaker', 'audio'], num_rows: 2000 }) }) ``` ## Additional Information ### Licensing The corpus is made available under the [Creative Commons license (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
提供机构:
thennal
原始信息汇总

GEC Barton Hill Malayalam Speech Corpus 数据集概述

数据集基本信息

  • 名称: GEC Barton Hill Malayalam Speech Corpus
  • 简称: GMaSC
  • 语言: 马拉雅拉姆语(Malayalam)
  • 多语言性: 单语种
  • 数据集大小: 1,000 < n < 10,000 实例
  • 创建者: 专家生成
  • 许可: Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)

数据集内容

  • 实例数量: 2,000
  • 特征:
    • text (字符串): 音频文件的转录
    • speaker (字符串): 说话人名称
    • audio (字典): 包含音频数组、采样率和音频路径(始终为None)
  • 音频特性:
    • 采样率: 48,000 Hz
  • 说话人信息:
    • Sonia (女性, 43岁): 1,000句, 时长1小时2分17秒
    • Anil (男性, 48岁): 1,000句, 时长1小时17分23秒
    • 总计: 2,000句, 时长2小时19分40秒

数据集结构

  • 数据分割: 单一训练集(train)
  • 数据集对象结构: json DatasetDict({ train: Dataset({ features: [text, speaker, audio], num_rows: 2000 }) })

任务类别

  • 文本到语音转换
  • 自动语音识别
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作