LIBE委员会的转录语料库
收藏arXiv2023-04-17 更新2024-06-21 收录
下载链接:
https://github.com/hdvos/EUParliamentASRDataAndCode
下载链接
链接失效反馈官方服务:
资源简介:
本研究介绍了由莱顿大学创建的LIBE委员会的转录语料库,总计包含360万运行词。该数据集源自欧盟议会委员会会议的音频记录,通过自动语音识别技术转录而成。数据集内容丰富,涵盖了详细的政治辩论和讨论,为政治科学家提供了宝贵的研究材料。创建过程中,研究团队采用了基于transformer的Wav2vec2.0模型,并进行了领域特定优化,显著提高了转录准确性。该数据集不仅有助于深入理解欧盟内部的政治动态,还为语言学家研究政治话语和口译员的角色提供了丰富的素材。
This study introduces the transcribed corpus of the LIBE Committee, created by Leiden University, which contains a total of 3.6 million running words. This dataset is derived from audio recordings of European Parliament Committee meetings, and was transcribed using automatic speech recognition (ASR) technology. The dataset contains rich content spanning detailed political debates and discussions, serving as valuable research material for political scientists. During its creation, the research team adopted a Transformer-based Wav2vec2.0 model and conducted domain-specific optimization, significantly improving transcription accuracy. This dataset not only facilitates in-depth understanding of political dynamics within the European Union, but also provides abundant materials for linguists to study political discourse and the role of interpreters.
提供机构:
莱顿大学
创建时间:
2023-04-17



