dash8x/dv-presidential-speech
收藏Hugging Face2023-07-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dash8x/dv-presidential-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- automatic-speech-recognition
- text-to-speech
language:
- dv
tags:
- audio
- dhivehi
- yag
- speech
- president
- political
size_categories:
- 1K<n<10K
---
# Dataset Card for Dhivehi Presidential Speech 1.0
### Dataset Summary
Dhivehi Presidential Speech is a Dhivehi speech dataset created from data extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets).
The dataset contains around 2.5 hrs (1 GB) of speech collected from Maldives President's Office consisting of 7 speeches given by President Yaameen Abdhul Gayyoom.
### Supported Tasks and Leaderboards
- Automatic Speech Recognition
- Text-to-Speech
### Languages
Dhivehi
## Dataset Structure
### Data Instances
A typical data point comprises the path to the audio file and its sentence.
```json
{
'path': 'dv-presidential-speech-train/waves/YAG2_77.wav',
'sentence': 'އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން',
'audio': {
'path': 'dv-presidential-speech-train/waves/YAG2_77.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000
},
}
```
### Data Fields
- path (string): The path to the audio file.
- sentence (string): The transcription for the audio file.
- audio (dict): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0].
### Data Splits
The speech material has been subdivided into portions for train, test and validation. The test clips were generated from a speech not in the train split. For the validation split, there is a slight overlap of 1 speech in the train set.
| | Train | Validation | Test |
| ---------------- | -------- | ---------- | ----- |
| Speakers | 1 | 1 | 1 |
| Utterances | 1612 | 200 | 200 |
| Duration | 02:14:59 | 17:02 | 13:30 |
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
Extracted and processed by [Sofwath](https://github.com/Sofwath) as part of a collection of Dhivehi datasets found [here](https://github.com/Sofwath/DhivehiDatasets).
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
提供机构:
dash8x
原始信息汇总
数据集概述
数据集名称
Dhivehi Presidential Speech 1.0
数据集摘要
Dhivehi Presidential Speech是一个由Sofwath创建的Dhivehi语音数据集,包含从马尔代夫总统办公室收集的7个总统Yaameen Abdhul Gayyoom的演讲,总时长约2.5小时(1 GB)。
支持的任务
- 自动语音识别
- 文本到语音转换
语言
Dhivehi
数据集结构
数据实例
每个数据点包含音频文件的路径及其对应的句子。
json
{
path: dv-presidential-speech-train/waves/YAG2_77.wav,
sentence: އަދި އަޅުގަނޑުމެންގެ ސަރަޙައްދުގައިވެސް މިކަހަލަ ބޭބޭފުޅުން,
audio: {
path: dv-presidential-speech-train/waves/YAG2_77.wav,
array: array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
sampling_rate: 16000
},
}
数据字段
path(字符串): 音频文件的路径。sentence(字符串): 音频文件的转录文本。audio(字典): 包含下载的音频文件路径、解码后的音频数组和采样率。
数据分割
数据集被分为训练、验证和测试集。测试集来自未包含在训练集中的演讲,验证集与训练集有轻微重叠。
| 训练 | 验证 | 测试 | |
|---|---|---|---|
| 发言人 | 1 | 1 | 1 |
| 话语数 | 1612 | 200 | 200 |
| 时长 | 02:14:59 | 17:02 | 13:30 |
搜集汇总
数据集介绍

背景与挑战
背景概述
Dhivehi Presidential Speech是一个迪维希语语音数据集,包含约2.5小时(1GB)来自马尔代夫总统办公室的演讲音频,由总统亚明·阿卜杜勒·加尧姆的7次演讲构成。该数据集支持自动语音识别和文本到语音任务,数据分为训练、验证和测试集,总共有约2012个话语,适用于低资源语言处理研究。
以上内容由遇见数据集搜集并总结生成



