classla/ParlaSpeech-PL

Name: classla/ParlaSpeech-PL
Creator: classla
Published: 2025-07-02 06:02:15
License: 暂无描述

Hugging Face2025-07-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/classla/ParlaSpeech-PL

下载链接

链接失效反馈

官方服务：

资源简介：

ParlaSpeech-PL数据集是基于波兰议会会议记录构建的，包含与转录文本中的特定句子相对应的音频片段。转录文本包含单词级别的对录音的时间对齐信息，每个实例包括字符和毫秒级的开始和结束偏移，这使得可以进一步将长句子分割成更短的片段，以便用于自动语音识别(ASR)和其他内存敏感型应用。数据集中已经移除了超过30秒的序列，以便于在大多数现代GPU上简单使用。每个片段都有一个与ParlaMint 4.0语料库中话语ID和字符偏移相对应的标识符。HuggingFace版本的数据集只提供了部分元数据信息，包括日期、发言人姓名、性别、出生年份、当时政党隶属关系、政党当时的地位（执政或反对党）和政党倾向（左、右、中等等）。此外，这个版本的数据集还包含一个text_normalised属性，其中包含了删除了议会评论的文本。

The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings in Poland, containing audio segments corresponding to specific sentences in the transcripts. The transcript includes word-level alignments to the recordings, with each instance consisting of character and millisecond start and end offsets, allowing for further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Sequences longer than 30 seconds have been removed from this dataset, making it easy to use on most modern GPUs. Each segment has an identifier referencing the utterance ID and character offsets in the ParlaMint 4.0 corpus. The HuggingFace version of the dataset provides a subset of metadata, including the date, the name of the speaker, their gender, year of birth, party affiliation at that time, the status of the party at that time (coalition or opposition), and party orientation (left, right, center, etc.). Additionally, this version of the dataset includes a text_normalised attribute, which contains the text with parliamentary comments removed.

提供机构：

classla

原始信息汇总

波兰议会口语数据集 ParlaSpeech-PL 1.0

数据集信息

特征

id: 字符串类型
audio: 音频类型，采样率为16000
text: 字符串类型
text_normalised: 字符串类型
words: 列表类型，包含以下子特征：
- char_e: 整数类型
- char_s: 整数类型
- time_e: 浮点数类型
- time_s: 浮点数类型
audio_length: 浮点数类型
date: 字符串类型
speaker_name: 字符串类型
speaker_gender: 字符串类型
speaker_birth: 字符串类型
speaker_party: 字符串类型
party_orientation: 字符串类型
party_status: 字符串类型

分割

train: 包含530773个样本，数据大小为61274022869.885字节

数据大小

下载大小: 60791222740字节
数据集大小: 61274022869.885字节

配置

default: 包含训练数据文件路径为data/train-*

数据集描述

该数据集是从波兰议会会议记录的ParlaMint语料库和波兰议会YouTube频道的会议录音构建的。语料库包含与记录中特定句子相对应的音频段。记录包含与录音对齐的单词级对齐，每个实例包含字符和毫秒的开始和结束偏移，允许对长句子进行简单进一步分割，以用于ASR和其他内存敏感应用。长度超过30秒的序列已从该数据集中移除，这应允许在大多数现代GPU上简单使用。

每个段都有一个标识符参考到ParlaMint 4.0语料库（通过话语ID和字符偏移）。

在原始数据集中，所有来自ParlaMint语料库的说话者信息都可通过speaker_info属性获得，而在HuggingFace版本中，仅提供元数据的子集，即：日期、说话者姓名、性别、出生年份、当时的政党隶属关系、政党当时的地位（联盟或反对派）和政党取向（左、右、中心等）。

与原始数据集不同，此版本还具有text_normalised属性，其中包含删除议会评论（如[[Applause]]等）的文本。

5,000+

优质数据集

54 个

任务类型

进入经典数据集