scribe-project/nst_nb_test

Name: scribe-project/nst_nb_test
Creator: scribe-project
Published: 2023-04-25 10:34:10
License: 暂无描述

Hugging Face2023-04-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/scribe-project/nst_nb_test

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: speaker_id dtype: string - name: gender dtype: string - name: utterance_id dtype: string - name: language dtype: string - name: raw_text dtype: string - name: full_audio_file dtype: string - name: original_data_split dtype: string - name: region dtype: string - name: duration dtype: float64 - name: start dtype: int64 - name: end dtype: float64 - name: utterance_audio_file dtype: audio - name: standardized_text dtype: string splits: - name: test num_bytes: 3046340447.0 num_examples: 15756 download_size: 2790946881 dataset_size: 3046340447.0 --- # Dataset Card for NST Bokmål test (< 15 sec. segments) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** <https://github.com/scribe-project/nodalida_2023_combined_training> - **Paper:** ``` @inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{\o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} } ``` - **Point of Contact:** [Per Erik Solberg](mailto:per.solberg@nb.no) ### Dataset Summary This is the version of the Bokmål part of the Norwegian NST dataset used for testing the models in the paper *Improving Generalization of Norwegian ASR with Limited Linguistic Resources* presented at NoDaLiDa 2023. It only contains segments of a length < 15 sec and only the test set. For a full version of the NST, see [this repository](https://huggingface.co/datasets/NbAiLab/NST). ### Languages Norwegian Bokmål ## Dataset Creation ### Source Data The full version of this dataset is found in [the repository of the Norwegian Language Bank](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/) #### Initial Data Collection and Normalization The data was retrieved using the [Spraakbanken downloader](https://pypi.org/project/spraakbanken-downloader/) and standardized using the [combined dataset standardization scripts](https://github.com/scribe-project/asr-standardized-combined). Bokmål segments with a duration < 15 seconds were extracted using [this code](https://github.com/scribe-project/nodalida_2023_combined_training/blob/main/make_datasets/make_nst_csvs.ipynb). ## Licensing Information [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{\o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} } ```

提供机构：

scribe-project

原始信息汇总

数据集概述

数据集名称

名称: NST Bokmål test (< 15 sec. segments)

数据集特征

特征列表:
- speaker_id: 字符串类型
- gender: 字符串类型
- utterance_id: 字符串类型
- language: 字符串类型
- raw_text: 字符串类型
- full_audio_file: 字符串类型
- original_data_split: 字符串类型
- region: 字符串类型
- duration: 浮点数类型
- start: 整数类型
- end: 浮点数类型
- utterance_audio_file: 音频类型
- standardized_text: 字符串类型

数据集结构

数据分割:
- test:
  - num_bytes: 3046340447.0
  - num_examples: 15756

数据集大小

下载大小: 2790946881
数据集大小: 3046340447.0

语言

语言: 挪威语 Bokmål

许可证信息

许可证: CC0

引用信息

@inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集