alexandrainst/ftspeech
收藏Hugging Face2024-09-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/alexandrainst/ftspeech
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: utterance_id
dtype: string
- name: speaker_gender
dtype: string
- name: sentence
dtype: string
- name: speaker_id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 209434570129.268
num_examples: 995677
- name: dev_balanced
num_bytes: 579692770.829
num_examples: 2601
- name: dev_other
num_bytes: 1725502342.095
num_examples: 7595
- name: test_balanced
num_bytes: 1158740779.222
num_examples: 5534
- name: test_other
num_bytes: 1254987645.527
num_examples: 5837
download_size: 101776974871
dataset_size: 214153493666.941
task_categories:
- automatic-speech-recognition
language:
- da
pretty_name: FT Speech
size_categories:
- 100K<n<1M
license: other
---
# Dataset Card for FT Speech
## Dataset Description
- **Repository:** <https://ftspeech.github.io/>
- **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk)
- **Size of downloaded dataset files:** 101.78 GB
- **Size of the generated dataset:** 214.15 GB
- **Total amount of disk used:** 315.93 GB
### Dataset Summary
This dataset is an upload of the [FT Speech dataset](https://ftspeech.github.io/).
The training, validation and test splits are the original ones.
### Supported Tasks and Leaderboards
Training automatic speech recognition is the intended task for this dataset. No leaderboard is active at this point.
### Languages
The dataset is available in Danish (`da`).
## Dataset Structure
### Data Instances
- **Size of downloaded dataset files:** 101.78 GB
- **Size of the generated dataset:** 214.15 GB
- **Total amount of disk used:** 315.93 GB
An example from the dataset looks as follows.
```
{
'utterance_id': 'S001_20151_M012_P00034-2',
'speaker_gender': 'F',
'sentence': 'alle de fem tekniske justeringer der er en del af lovforslaget',
'speaker_id': 'S001',
'audio': {
'path': 'S001_20151_M012_P00034-2.wav',
'array': array([-3.75366211e-03, -5.27954102e-03, -3.87573242e-03, ...,
9.15527344e-05, -1.52587891e-04, 5.79833984e-04]),
'sampling_rate': 16000
}
}
```
### Data Fields
The data fields are the same among all splits.
- `utterance_id`: a `string` feature.
- `speaker_gender`: a `string` feature.
- `sentence`: a `string` feature.
- `speaker_id`: a `string` feature.
- `audio`: an `Audio` feature.
### Dataset Statistics
There are 995,677 samples in the training split, 2,601 in the dev_balanced split, 7,595 in the dev_other split, 5,534 in the test_balanced and 5,837 in the test_other split.
#### Speakers
There are 374 unique speakers in the training dataset, 20 unique speakers in the validation dataset and 40 unique speakers in the test dataset. None of the dataset splits share any speakers.
#### Gender Distribution

#### Transcription Length Distribution

## Dataset Creation
### Curation Rationale
There are not many large-scale ASR datasets in Danish.
### Source Data
The data constitutes public recordings of sessions from the Danish Parliament, along with manual transcriptions.
## Additional Information
### Dataset Curators
Andreas Kirkedal, Marija Stepanović and Barbara Plank curated the dataset as part of their FT Speech paper (see citation below).
[Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra
Institute](https://alexandra.dk/) reorganised the dataset and uploaded it to the Hugging Face Hub.
### Licensing Information
The dataset is licensed under [this custom license](https://www.ft.dk/da/aktuelt/tv-fra-folketinget/deling-og-rettigheder).
### Citation
```
@inproceedings{ftspeech,
author = {Kirkedal, Andreas and Stepanović, Marija and Plank, Barbara},
title = {{FT Speech: Danish Parliament Speech Corpus}},
booktitle = {Proc. Interspeech 2020},
year = {2020},
url = {arxiv.org/abs/2005.12368}
}
```
提供机构:
alexandrainst
原始信息汇总
数据集概述
数据集名称
- 名称:FT Speech
- 语言:丹麦语 (
da) - 任务类别:自动语音识别
- 许可证:其他
- 大小类别:100K<n<1M
数据集结构
-
特征:
utterance_id:字符串类型speaker_gender:字符串类型sentence:字符串类型speaker_id:字符串类型audio:音频类型,采样率为16000
-
分割:
train:995,677样本,209,434,570,129.268字节dev_balanced:2,601样本,579,692,770.829字节dev_other:7,595样本,1,725,502,342.095字节test_balanced:5,534样本,1,158,740,779.222字节test_other:5,837样本,1,254,987,645.527字节
-
下载大小:101.78 GB
-
数据集大小:214.15 GB
数据集统计
-
训练集:995,677样本
-
dev_balanced:2,601样本
-
dev_other:7,595样本
-
test_balanced:5,534样本
-
test_other:5,837样本
-
发言人:
- 训练集:374个独特发言人
- 验证集:20个独特发言人
- 测试集:40个独特发言人
-
性别分布:详见图像
ftspeech-gender-distribution.png -
转录长度分布:详见图像
ftspeech-length-distribution.png
数据集创建
- 来源数据:公共的丹麦议会会议录音及手动转录
- 数据集整理者:Andreas Kirkedal, Marija Stepanović, Barbara Plank
- 数据集上传者:Dan Saattrup Nielsen
许可证信息
- 许可证:自定义许可证,详见 此链接
引用信息
@inproceedings{ftspeech, author = {Kirkedal, Andreas and Stepanović, Marija and Plank, Barbara}, title = {{FT Speech: Danish Parliament Speech Corpus}}, booktitle = {Proc. Interspeech 2020}, year = {2020} }



