alexandrainst/ftspeech

Name: alexandrainst/ftspeech
Creator: alexandrainst
Published: 2024-09-04 15:01:21
License: 暂无描述

Hugging Face2024-09-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/alexandrainst/ftspeech

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: utterance_id dtype: string - name: speaker_gender dtype: string - name: sentence dtype: string - name: speaker_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 209434570129.268 num_examples: 995677 - name: dev_balanced num_bytes: 579692770.829 num_examples: 2601 - name: dev_other num_bytes: 1725502342.095 num_examples: 7595 - name: test_balanced num_bytes: 1158740779.222 num_examples: 5534 - name: test_other num_bytes: 1254987645.527 num_examples: 5837 download_size: 101776974871 dataset_size: 214153493666.941 task_categories: - automatic-speech-recognition language: - da pretty_name: FT Speech size_categories: - 100K<n<1M license: other --- # Dataset Card for FT Speech ## Dataset Description - **Repository:** <https://ftspeech.github.io/> - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Size of downloaded dataset files:** 101.78 GB - **Size of the generated dataset:** 214.15 GB - **Total amount of disk used:** 315.93 GB ### Dataset Summary This dataset is an upload of the [FT Speech dataset](https://ftspeech.github.io/). The training, validation and test splits are the original ones. ### Supported Tasks and Leaderboards Training automatic speech recognition is the intended task for this dataset. No leaderboard is active at this point. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 101.78 GB - **Size of the generated dataset:** 214.15 GB - **Total amount of disk used:** 315.93 GB An example from the dataset looks as follows. ``` { 'utterance_id': 'S001_20151_M012_P00034-2', 'speaker_gender': 'F', 'sentence': 'alle de fem tekniske justeringer der er en del af lovforslaget', 'speaker_id': 'S001', 'audio': { 'path': 'S001_20151_M012_P00034-2.wav', 'array': array([-3.75366211e-03, -5.27954102e-03, -3.87573242e-03, ..., 9.15527344e-05, -1.52587891e-04, 5.79833984e-04]), 'sampling_rate': 16000 } } ``` ### Data Fields The data fields are the same among all splits. - `utterance_id`: a `string` feature. - `speaker_gender`: a `string` feature. - `sentence`: a `string` feature. - `speaker_id`: a `string` feature. - `audio`: an `Audio` feature. ### Dataset Statistics There are 995,677 samples in the training split, 2,601 in the dev_balanced split, 7,595 in the dev_other split, 5,534 in the test_balanced and 5,837 in the test_other split. #### Speakers There are 374 unique speakers in the training dataset, 20 unique speakers in the validation dataset and 40 unique speakers in the test dataset. None of the dataset splits share any speakers. #### Gender Distribution ![ftspeech-gender-distribution.png](https://cdn-uploads.huggingface.co/production/uploads/60d368a613f774189902f555/0h_L7-riNfQbKFdYWgy01.png) #### Transcription Length Distribution ![ftspeech-length-distribution.png](https://cdn-uploads.huggingface.co/production/uploads/60d368a613f774189902f555/z1MqsvACrY_8XNXAx0UcD.png) ## Dataset Creation ### Curation Rationale There are not many large-scale ASR datasets in Danish. ### Source Data The data constitutes public recordings of sessions from the Danish Parliament, along with manual transcriptions. ## Additional Information ### Dataset Curators Andreas Kirkedal, Marija Stepanović and Barbara Plank curated the dataset as part of their FT Speech paper (see citation below). [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) reorganised the dataset and uploaded it to the Hugging Face Hub. ### Licensing Information The dataset is licensed under [this custom license](https://www.ft.dk/da/aktuelt/tv-fra-folketinget/deling-og-rettigheder). ### Citation ``` @inproceedings{ftspeech, author = {Kirkedal, Andreas and Stepanović, Marija and Plank, Barbara}, title = {{FT Speech: Danish Parliament Speech Corpus}}, booktitle = {Proc. Interspeech 2020}, year = {2020}, url = {arxiv.org/abs/2005.12368} } ```

提供机构：

alexandrainst

原始信息汇总

数据集概述

数据集名称

名称：FT Speech
语言：丹麦语 (da)
任务类别：自动语音识别
许可证：其他
大小类别：100K<n<1M

数据集结构

特征：
- utterance_id：字符串类型
- speaker_gender：字符串类型
- sentence：字符串类型
- speaker_id：字符串类型
- audio：音频类型，采样率为16000
分割：
- train：995,677样本，209,434,570,129.268字节
- dev_balanced：2,601样本，579,692,770.829字节
- dev_other：7,595样本，1,725,502,342.095字节
- test_balanced：5,534样本，1,158,740,779.222字节
- test_other：5,837样本，1,254,987,645.527字节
下载大小：101.78 GB
数据集大小：214.15 GB

数据集统计

训练集：995,677样本
dev_balanced：2,601样本
dev_other：7,595样本
test_balanced：5,534样本
test_other：5,837样本
发言人：
- 训练集：374个独特发言人
- 验证集：20个独特发言人
- 测试集：40个独特发言人
性别分布：详见图像 ftspeech-gender-distribution.png
转录长度分布：详见图像 ftspeech-length-distribution.png

数据集创建

来源数据：公共的丹麦议会会议录音及手动转录
数据集整理者：Andreas Kirkedal, Marija Stepanović, Barbara Plank
数据集上传者：Dan Saattrup Nielsen

许可证信息

许可证：自定义许可证，详见此链接

引用信息

@inproceedings{ftspeech, author = {Kirkedal, Andreas and Stepanović, Marija and Plank, Barbara}, title = {{FT Speech: Danish Parliament Speech Corpus}}, booktitle = {Proc. Interspeech 2020}, year = {2020} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集