NbAiLab/NPSC_test
收藏Hugging Face2022-11-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NbAiLab/NPSC_test
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- nb
- 'no'
- nn
license:
- cc0-1.0
multilinguality:
- monolingual
size_categories:
- 2G<n<1B
source_datasets:
- original
task_categories:
- automatic-speech-recognition
- audio-classification
task_ids:
- speech-modeling
pretty_name: NPSC
tags:
- speech-modeling
---
# Dataset Card for NBAiLab/NPSC
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Data Fields](#data-fiels)
- [Dataset Creation](#dataset-creation)
- [Statistics](#statistics)
- [Document Types](#document-types)
- [Languages](#languages)
- [Publish Periode](#publish-periode)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://www.nb.no/sprakbanken/
- **Repository:** https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-58/
- **Paper:** https://www.nb.no/sprakbanken/
- **Point of Contact:** [Per Erik Solberg](mailto:per.solberg@nb.no)
The Norwegian Parliament Speech Corpus (NPSC) is a corpus for training a Norwegian ASR (Automatic Speech Recognition) models. The corpus is created by Språkbanken at the National Library in Norway.
NPSC is based on sound recording from meeting in the Norwegian Parliament. These talks are orthographically transcribed to either Norwegian Bokmål or Norwegian Nynorsk. In addition to the data actually included in this dataset, there is a significant amount of metadata that is included in the original corpus. Through the speaker id there is additional information about the speaker, like gender, age, and place of birth (ie dialect). Through the proceedings id the corpus can be linked to the official proceedings from the meetings.
The corpus is in total sound recordings from 40 entire days of meetings. This amounts to 140 hours of speech, 65,000 sentences or 1.2 million words.
This corpus is an adaption of the original corpus made for efficiant ASR training. For simplicity and portability, a few of the original datasets features, like the token transcription, is ommitted. You can find the complete dataset at [the Resource Catalogue at Språkbanken](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-58/).
## How to Use (This needs to be edited of course)
```python
from datasets import load_dataset
data = load_dataset("nb/NPSC", streaming=True)
```
## Data Fields
Currently there are two versions included in this repo.
### Version A
This verison has a short list of the metadata and includes the audio (48k mp3) encoded as a float32 array in the dataset itself.
The current dataloader script is associated with this version.
One line in train.json looks like this:
```json
{
"sentence_id": 7309,
"sentence_order": 0,
"speaker_id": 1,
"speaker_name": "Marit Nybakk",
"sentence_text": "Stortingets møte er lovlig satt",
"sentence_language_code": "nb-NO",
"text": "Stortingets møte er lovlig satt",
"start_time": 302650,
"end_time": 306000,
"normsentence_text": "Stortingets møte er lovlig satt",
"transsentence_text": "Stortingets møte er lovleg sett",
"translated": 1,
"audio": {
"path": "audio/20170207-095506_302650_306000.wav",
"array": [
24,
25,
50,
(...)
],
"sampling_rate": 48000
}
}
```
### Version B
This verison does not contain the audio encoded in the dataset. Instead it has the audio files placed in sub-directories. There are currently both samples in clips_48k_wav and clips_16k_mp3. Only the base filename is referred in the dataset. Please not that there are both sentence-based audio clips as well at meeting-based audio clips. The dataset contains referrals to both, the latter referral has start and stop time as well.
One line in the train/metadata.json looks like this:
```json
{
"meeting_date": "20170207",
"full_audio_file": "20170207-095506",
"proceedings_file": "20170207-095506.ref",
"duration": 4442474,
"transcriber_id": 1,
"reviewer_id": 2,
"data_split": "test",
"speaker_name": "Marit Nybakk",
"speaker_id": 1,
"sentence_id": 7309,
"sentence_language_code": "nb-NO",
"sentence_text": "Stortingets møte er lovlig satt",
"sentence_order": 0,
"audio_file": "20170207-095506_302650_306000",
"start_time": 302650,
"end_time": 306000,
"normsentence_text": "Stortingets møte er lovlig satt",
"transsentence_text": "Stortingets møte er lovleg sett",
"translated": 1
}
```
### Dataset Creation
We are providing a **train**, **dev** and **test** split. These are the same as in the orginal corpus.
Build date: 20012022
#### Initial Data Collection and Curation
The procedure for the dataset creation is described in detail in the paper.
## Statistics
| Feature | Value |
|:---------|-----------:|
| Duration, pauses included | 140,3 hours|
| Duration, pauses not included | 125,7 hours |
| Word count | 1,2 million |
| Sentence count | 64.531 |
| Language distribution | Nynorsk: 12,8%|
| | Bokmål: 87,2%%|
| Gender distribution | Female: 38,3% |
| | Male: 61.7% |
## Considerations for Using the Data
This corpus contains speech data and is allowed to be used outside the National Library of Norway for speech recognition technology purposes.
### Discussion of Biases
Please refer to our paper.
### Dataset Curators
[Per Erik Solberg](mailto:per.solberg@nb.no)
[Freddy Wetjen](mailto:Freddy.wetjen@nb.no), [Andre Kaasen](mailto:andre.kasen@nb.no) and [Per Egil Kummervold](mailto:per.kummervold@nb.no) has contributed to porting it to the Hugging Face Dataset format.
### Licensing Information
Licensed for use outside the National Library of Norway.
## License
CC-ZERO(https://creativecommons.org/publicdomain/zero/1.0/)
### Citation Information
We are preparing an article with detailed information about this corpus. Until it is published, please cite out paper discussing the first version of this corpus:
```
ANDRE: TO BE DONE
```
提供机构:
NbAiLab
原始信息汇总
数据集概述
基本信息
- 名称: NPSC (Norwegian Parliament Speech Corpus)
- 语言: 挪威语(Bokmål、Nynorsk)
- 许可证: CC0-1.0
- 多语言性: 单语种
- 大小: 2G<n<1B
- 源数据集: 原始数据
- 任务类别: 自动语音识别、音频分类
- 任务ID: speech-modeling
- 美观名称: NPSC
- 标签: speech-modeling
数据集描述
- 用途: 用于训练挪威语自动语音识别(ASR)模型
- 内容: 基于挪威议会会议的录音,包含40天的会议记录,总计140小时语音,65,000句或120万字
- 特点: 包含说话人ID,可获取说话人的性别、年龄和出生地(方言)等信息;通过议程ID可链接至会议的官方记录
- 版本: 提供两个版本,版本A包含音频数据,版本B音频数据存储在外部
数据字段
- 版本A: 包含元数据和音频数据(48kHz mp3),音频编码为float32数组
- 版本B: 不包含编码音频,音频文件存储在子目录中,包含基于句子和会议的音频片段
数据集创建
- 数据分割: 提供训练、开发和测试集
- 创建日期: 2022年1月20日
统计信息
- 总时长(含停顿): 140.3小时
- 总时长(不含停顿): 125.7小时
- 单词计数: 120万
- 句子计数: 64,531
- 语言分布: Nynorsk占12.8%,Bokmål占87.2%
- 性别分布: 女性占38.3%,男性占61.7%
使用考虑
- 许可: 允许在挪威国家图书馆外使用于语音识别技术
- 偏见讨论: 请参考相关论文
许可证
- 类型: CC-ZERO
引用信息
- 待定: 正在准备详细描述该语料库的文章,目前请引用讨论该语料库第一版本的论文
搜集汇总
数据集介绍

构建方式
在语音识别技术领域,高质量语料库的构建对于模型训练至关重要。挪威议会语音语料库(NPSC)的构建过程体现了严谨的学术规范,其原始数据来源于挪威议会的会议录音,总计涵盖40个完整会议日的音频资料。这些录音经由专业团队进行人工转写,形成了对应的书面文本,并依据挪威的两种官方书面语变体——博克马尔语和新挪威语进行标注。为确保数据的一致性与可用性,构建过程中剔除了原始语料中的部分特征(如分词转录),并进行了适应高效自动语音识别训练的优化处理,最终形成了包含训练集、开发集和测试集的标准化分割版本。
特点
该数据集在语音建模领域展现出鲜明的特色。其核心价值在于提供了总计超过140小时的挪威议会会议高质量录音,对应约120万词和6.4万句的精确转写文本。语料在语言变体上呈现非均衡分布,其中博克马尔语占比约87.2%,新挪威语约占12.8%,这客观反映了挪威议会语言使用的实际状况。此外,数据集蕴含丰富的元数据,通过发言人ID可关联其性别、年龄及出生地(即方言背景)等信息,为研究语音模型中的社会语言学变量提供了宝贵维度。数据以两种版本提供,分别将音频内嵌为数组或外置为文件,兼顾了使用的便捷性与灵活性。
使用方法
对于致力于挪威语语音识别模型开发的研究者而言,该数据集提供了便捷的接入途径。用户可通过Hugging Face的`datasets`库,使用`load_dataset("nb/NPSC", streaming=True)`指令直接加载数据流。数据集包含A、B两种版本:版本A将48kHz采样率的MP3音频编码为浮点数组内嵌于JSON结构中,并附有句子文本及起止时间等关键字段;版本B则采用外部分离的音频文件(提供48kHz WAV和16kHz MP3两种格式),JSON文件中仅包含音频文件路径及元数据。研究者可根据计算资源与实验需求,灵活选择相应版本进行模型训练与评估。
背景与挑战
背景概述
挪威议会语音语料库(NPSC)由挪威国家图书馆的Språkbanken团队于2022年创建,旨在为挪威语自动语音识别(ASR)模型提供训练资源。该语料库基于挪威议会的会议录音,涵盖了140小时的语音数据,包含约6.5万句句子和120万词汇,并标注了挪威博克马尔语和尼诺斯克语两种官方变体。其核心研究问题聚焦于提升低资源语言环境下语音识别系统的性能,通过丰富的元数据(如说话者性别、年龄和方言背景)支持多维度语言分析,对北欧语言技术研究和数字人文领域产生了显著影响。
当前挑战
在领域问题层面,NPSC致力于应对挪威语作为低资源语言在语音识别中的挑战,包括方言多样性、语言变体差异以及有限标注数据导致的模型泛化困难。构建过程中,团队需处理大量议会录音的转写对齐、多说话者环境下的音频分割,以及确保转录在两种书面挪威语变体间的准确性与一致性。此外,语料库的元数据整合与隐私合规性也构成了重要技术障碍。
常用场景
经典使用场景
在挪威语语音识别领域,NbAiLab/NPSC_test数据集作为挪威议会语音语料库的测试子集,其经典使用场景聚焦于自动语音识别模型的评估与优化。该数据集源自挪威议会会议的真实录音,涵盖了挪威语两种官方变体——博克马尔语和新挪威语,总计包含140小时的语音数据,约6.5万句转录文本。研究人员通常利用这一数据集对预训练的ASR模型进行端到端的性能测试,特别是在多方言、多说话人环境下的识别准确率与鲁棒性分析。通过其精细的时间戳标注与说话人元数据,该数据集支持对语音分段、语言变体转换及噪声环境下的识别任务进行深入探究,为挪威语语音技术的研究提供了标准化基准。
实际应用
在实际应用层面,NbAiLab/NPSC_test数据集为挪威语语音技术的产业化部署奠定了坚实基础。基于该数据集训练的语音识别系统可直接应用于挪威议会会议的实时转录服务,提升政府议事记录的效率与准确性。在教育和媒体领域,该系统能够支持挪威语有声内容的自动字幕生成,促进听力障碍者的信息无障碍获取。此外,数据集中的方言多样性使其适用于开发适应挪威各地区口音的智能助理与客服系统,增强语音交互的自然度与包容性。这些应用不仅推动了挪威本土语音生态的发展,也为其他低资源语言的语音技术落地提供了可复制的范式。
衍生相关工作
围绕NbAiLab/NPSC_test数据集,学术界已衍生出一系列经典研究工作。例如,挪威国家图书馆与多所高校合作,利用该数据集开发了首个开源挪威语端到端语音识别模型,显著提升了挪威语ASR的基准性能。后续研究进一步探索了基于元数据的多任务学习框架,通过结合说话人特征与方言标签,增强了模型在跨说话人场景下的适应性。此外,部分工作聚焦于数据集的偏差分析,提出了针对性别与方言平衡的语音数据增强方法,为构建更公平的语音技术提供了方法论指导。这些成果不仅丰富了挪威语计算语言学的研究图景,也为全球低资源语言语音识别社区贡献了关键资源与洞见。
以上内容由遇见数据集搜集并总结生成



