five

classla/ParlaSpeech-CZ

收藏
Hugging Face2025-07-02 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/classla/ParlaSpeech-CZ
下载链接
链接失效反馈
官方服务:
资源简介:
ParlaSpeech-CZ数据集是由捷克ParlaMint语料库的议会进程转录和AudioPSP数据集的议会录音构建而成的。该数据集包含与转录中的特定句子对应的音频段,转录中包含了词级对齐信息,每个实例由字符和毫秒级的开始和结束偏移组成,便于ASR和其他内存敏感应用的进一步分割。数据集中的序列已超过30秒被移除,以便在大多数现代GPU上简单使用。每个音频段都有一个标识符,引用ParlaMint 4.0语料库的utterance ID和字符偏移。与原始数据集相比,HuggingFace版本的数据集仅包含部分元数据,例如日期、演讲者姓名、性别、出生年份、当时政党隶属关系、政党状态(执政或反对派)以及政党倾向(左、右、中等等)。此外,这个版本还包含一个`text_normalised`属性,其中包含了去除议会评论(如`[[Applause]]`)的文本。

The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus and the parliamentary recordings available from the AudioPSP dataset. The dataset consists of audio segments corresponding to specific sentences in the transcripts, with word-level alignments to the recordings. Each instance includes character and millisecond start and end offsets. Sequences longer than 30 seconds have been removed to facilitate usage on most modern GPUs. Each segment is identified with a reference to the ParlaMint 4.0 corpus via utterance ID and character offsets. Compared to the original dataset, the HuggingFace version contains only a subset of metadata, including date, speakers name, gender, year of birth, party affiliation, party status (coalition or opposition), and party orientation (left, right, center, etc.). Additionally, this version includes a `text_normalised` attribute containing text with parliamentary comments removed.
提供机构:
classla
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作