Replication Data for: Potential and Pitfalls of Audio-as-Data: alignment, features and classification models
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://doi.org/10.7910/DVN/K3I16E
下载链接
链接失效反馈官方服务:
资源简介:
Political science is a field rich in multimodal information sources, from televised debates to parliamentary briefings. This paper bridges a gap between computer and political science in multimodal data analysis using audio. The adoption of multimodal analyses in political science (e.g., video/audio with text-as-data approaches) has been relatively slow due to unequal distribution of computational power and skills needed. We provide solutions to challenges encountered when analyzing audio, advancing potential for multimodal data analysis in political science. Using a dataset of all televised US presidential debates from 1960-2020, we focus on three features encountered when analyzing audio data: low level descriptors (LLDs) like pitch or energy, Mel-frequency cepstral coefficients (MFCCs), and audio embeddings/encodings like Wav2Vec. We showcase four applications: a) forced alignment of audio-text using MFCCs, timestamping transcripts and speaker information; b) speech characterization using LLDs; c) custom-made classification models with audio embeddings and MFCCs; and d) emotional recognition models using Wav2Vec for classification of discrete emotions and their valence-arousal-dominance. We provide explanations to help understand how these features can be applied for different political research questions and advice on vigilance to naive interpretation, for both experienced researchers and those who want to start working with audio.
政治学领域拥有丰富的多模态信息来源,从电视辩论到议会简报皆涵盖其中。本研究填补了计算机科学与政治学在音频多模态数据分析交叉领域的研究空白。由于所需计算算力与专业技能分布不均,政治学领域对多模态分析(例如结合文本数据方法的音视频分析)的应用相对滞后。针对音频数据分析中遭遇的各类挑战,本研究提供了针对性解决方案,为政治学领域多模态数据分析的发展拓展了空间。本研究采用1960年至2020年全部美国总统电视辩论数据集,重点聚焦音频数据分析中的三类核心特征:低层级描述符(low level descriptors, LLDs,如音高、能量)、梅尔频率倒谱系数(Mel-frequency cepstral coefficients, MFCCs),以及Wav2Vec等音频嵌入/编码表征。本研究展示了四类应用场景:a)基于MFCCs的音-文本强制对齐,实现转录文本与说话人信息的时间戳标注;b)基于LLDs的语音特征刻画;c)结合音频嵌入与MFCCs的定制化分类模型;d)基于Wav2Vec的情感识别模型,用于离散情感及其效价-唤醒-支配(valence-arousal-dominance, VAD)维度的分类。本研究不仅提供相关阐释,帮助研究者理解如何将这些特征应用于各类政治学研究问题,还同时面向资深研究者与音频数据分析入门者,提出了需警惕朴素解读的实操建议。
创建时间:
2025-11-22



