five

scribe-project/nst_nb_test

收藏
Hugging Face2023-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/scribe-project/nst_nb_test
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: speaker_id dtype: string - name: gender dtype: string - name: utterance_id dtype: string - name: language dtype: string - name: raw_text dtype: string - name: full_audio_file dtype: string - name: original_data_split dtype: string - name: region dtype: string - name: duration dtype: float64 - name: start dtype: int64 - name: end dtype: float64 - name: utterance_audio_file dtype: audio - name: standardized_text dtype: string splits: - name: test num_bytes: 3046340447.0 num_examples: 15756 download_size: 2790946881 dataset_size: 3046340447.0 --- # Dataset Card for NST Bokmål test (< 15 sec. segments) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** <https://github.com/scribe-project/nodalida_2023_combined_training> - **Paper:** ``` @inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{\o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} } ``` - **Point of Contact:** [Per Erik Solberg](mailto:per.solberg@nb.no) ### Dataset Summary This is the version of the Bokmål part of the Norwegian NST dataset used for testing the models in the paper *Improving Generalization of Norwegian ASR with Limited Linguistic Resources* presented at NoDaLiDa 2023. It only contains segments of a length < 15 sec and only the test set. For a full version of the NST, see [this repository](https://huggingface.co/datasets/NbAiLab/NST). ### Languages Norwegian Bokmål ## Dataset Creation ### Source Data The full version of this dataset is found in [the repository of the Norwegian Language Bank](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/) #### Initial Data Collection and Normalization The data was retrieved using the [Spraakbanken downloader](https://pypi.org/project/spraakbanken-downloader/) and standardized using the [combined dataset standardization scripts](https://github.com/scribe-project/asr-standardized-combined). Bokmål segments with a duration < 15 seconds were extracted using [this code](https://github.com/scribe-project/nodalida_2023_combined_training/blob/main/make_datasets/make_nst_csvs.ipynb). ## Licensing Information [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{\o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} } ```
提供机构:
scribe-project
原始信息汇总

数据集概述

数据集名称

  • 名称: NST Bokmål test (< 15 sec. segments)

数据集特征

  • 特征列表:
    • speaker_id: 字符串类型
    • gender: 字符串类型
    • utterance_id: 字符串类型
    • language: 字符串类型
    • raw_text: 字符串类型
    • full_audio_file: 字符串类型
    • original_data_split: 字符串类型
    • region: 字符串类型
    • duration: 浮点数类型
    • start: 整数类型
    • end: 浮点数类型
    • utterance_audio_file: 音频类型
    • standardized_text: 字符串类型

数据集结构

  • 数据分割:
    • test:
      • num_bytes: 3046340447.0
      • num_examples: 15756

数据集大小

  • 下载大小: 2790946881
  • 数据集大小: 3046340447.0

语言

  • 语言: 挪威语 Bokmål

许可证信息

  • 许可证: CC0

引用信息

@inproceedings{ solberg2023improving, title={Improving Generalization of Norwegian {ASR} with Limited Linguistic Resources}, author={Per Erik Solberg and Pablo Ortiz and Phoebe Parsons and Torbj{o}rn Svendsen and Giampiero Salvi}, booktitle={The 24rd Nordic Conference on Computational Linguistics}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作