dash8x/dv-presidential-speech

Name: dash8x/dv-presidential-speech
Creator: dash8x
Published: 2023-07-19 01:24:44
License: 暂无描述

Hugging Face2023-07-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/dash8x/dv-presidential-speech

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - automatic-speech-recognition - text-to-speech language: - dv tags: - audio - dhivehi - yag - speech - president - political size_categories: - 1K<n<10K --- # Dataset Card for Dhivehi Presidential Speech 1.0 ### Dataset Summary Dhivehi Presidential Speech is a Dhivehi speech dataset created from data extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets). The dataset contains around 2.5 hrs (1 GB) of speech collected from Maldives President's Office consisting of 7 speeches given by President Yaameen Abdhul Gayyoom. ### Supported Tasks and Leaderboards - Automatic Speech Recognition - Text-to-Speech ### Languages Dhivehi ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file and its sentence. ```json { 'path': 'dv-presidential-speech-train/waves/YAG2_77.wav', 'sentence': 'އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން', 'audio': { 'path': 'dv-presidential-speech-train/waves/YAG2_77.wav', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 16000 }, } ``` ### Data Fields - path (string): The path to the audio file. - sentence (string): The transcription for the audio file. - audio (dict): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0]. ### Data Splits The speech material has been subdivided into portions for train, test and validation. The test clips were generated from a speech not in the train split. For the validation split, there is a slight overlap of 1 speech in the train set. | | Train | Validation | Test | | ---------------- | -------- | ---------- | ----- | | Speakers | 1 | 1 | 1 | | Utterances | 1612 | 200 | 200 | | Duration | 02:14:59 | 17:02 | 13:30 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets). #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

提供机构：

dash8x

原始信息汇总

数据集概述

数据集名称

Dhivehi Presidential Speech 1.0

数据集摘要

Dhivehi Presidential Speech是一个由Sofwath创建的Dhivehi语音数据集，包含从马尔代夫总统办公室收集的7个总统Yaameen Abdhul Gayyoom的演讲，总时长约2.5小时（1 GB）。

支持的任务

自动语音识别
文本到语音转换

语言

Dhivehi

数据集结构

数据实例

每个数据点包含音频文件的路径及其对应的句子。

json { path: dv-presidential-speech-train/waves/YAG2_77.wav,
sentence: އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން, audio: { path: dv-presidential-speech-train/waves/YAG2_77.wav, array: array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), sampling_rate: 16000 }, }

数据字段

path (字符串): 音频文件的路径。
sentence (字符串): 音频文件的转录文本。
audio (字典): 包含下载的音频文件路径、解码后的音频数组和采样率。

数据分割

数据集被分为训练、验证和测试集。测试集来自未包含在训练集中的演讲，验证集与训练集有轻微重叠。

	训练	验证	测试
发言人	1	1	1
话语数	1612	200	200
时长	02:14:59	17:02	13:30

搜集汇总

数据集介绍

背景与挑战

背景概述

Dhivehi Presidential Speech是一个迪维希语语音数据集，包含约2.5小时（1GB）来自马尔代夫总统办公室的演讲音频，由总统亚明·阿卜杜勒·加尧姆的7次演讲构成。该数据集支持自动语音识别和文本到语音任务，数据分为训练、验证和测试集，总共有约2012个话语，适用于低资源语言处理研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集