five

dash8x/dv-presidential-speech

收藏
Hugging Face2023-07-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dash8x/dv-presidential-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - automatic-speech-recognition - text-to-speech language: - dv tags: - audio - dhivehi - yag - speech - president - political size_categories: - 1K<n<10K --- # Dataset Card for Dhivehi Presidential Speech 1.0 ### Dataset Summary Dhivehi Presidential Speech is a Dhivehi speech dataset created from data extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets). The dataset contains around 2.5 hrs (1 GB) of speech collected from Maldives President's Office consisting of 7 speeches given by President Yaameen Abdhul Gayyoom. ### Supported Tasks and Leaderboards - Automatic Speech Recognition - Text-to-Speech ### Languages Dhivehi ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file and its sentence. ```json { 'path': 'dv-presidential-speech-train/waves/YAG2_77.wav', 'sentence': 'އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން', 'audio': { 'path': 'dv-presidential-speech-train/waves/YAG2_77.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, } ``` ### Data Fields - path (string): The path to the audio file. - sentence (string): The transcription for the audio file. - audio (dict): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0]. ### Data Splits The speech material has been subdivided into portions for train, test and validation. The test clips were generated from a speech not in the train split. For the validation split, there is a slight overlap of 1 speech in the train set. | | Train | Validation | Test | | ---------------- | -------- | ---------- | ----- | | Speakers | 1 | 1 | 1 | | Utterances | 1612 | 200 | 200 | | Duration | 02:14:59 | 17:02 | 13:30 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets). #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
dash8x
原始信息汇总

数据集概述

数据集名称

Dhivehi Presidential Speech 1.0

数据集摘要

Dhivehi Presidential Speech是一个由Sofwath创建的Dhivehi语音数据集,包含从马尔代夫总统办公室收集的7个总统Yaameen Abdhul Gayyoom的演讲,总时长约2.5小时(1 GB)。

支持的任务

  • 自动语音识别
  • 文本到语音转换

语言

Dhivehi

数据集结构

数据实例

每个数据点包含音频文件的路径及其对应的句子。

json { path: dv-presidential-speech-train/waves/YAG2_77.wav,
sentence: އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން, audio: { path: dv-presidential-speech-train/waves/YAG2_77.wav, array: array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), sampling_rate: 16000 }, }

数据字段

  • path (字符串): 音频文件的路径。
  • sentence (字符串): 音频文件的转录文本。
  • audio (字典): 包含下载的音频文件路径、解码后的音频数组和采样率。

数据分割

数据集被分为训练、验证和测试集。测试集来自未包含在训练集中的演讲,验证集与训练集有轻微重叠。

训练 验证 测试
发言人 1 1 1
话语数 1612 200 200
时长 02:14:59 17:02 13:30
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Dhivehi Presidential Speech是一个迪维希语语音数据集,包含约2.5小时(1GB)来自马尔代夫总统办公室的演讲音频,由总统亚明·阿卜杜勒·加尧姆的7次演讲构成。该数据集支持自动语音识别和文本到语音任务,数据分为训练、验证和测试集,总共有约2012个话语,适用于低资源语言处理研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作