PersonalHub
收藏魔搭社区2025-12-04 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/PersonalHub
下载链接
链接失效反馈官方服务:
资源简介:
# Personal Hub: Exploring High-Expressiveness Speech Data through Spatio-Temporal Feature Integration and Model Fine-Tuning

# Introduction
In this work, we present Personal Hub, a novel framework for mining and utilizing high-expressivity speech data by integrating spatio-temporal context with combinatorial attribute control. At the core of our approach lies a Speech Attribute Matrix, which enables annotators to systematically combine speaker-related features such as age, gender, emotion, accent, and environment with temporal metadata, to curate speech samples with varied and rich expressive characteristics.
Based on this matrix-driven data collection paradigm, we construct a multi-level expressivity dataset, categorized into three tiers according to the diversity and complexity of attribute combinations. We then investigate the benefits of this curated data through two lines of model fine-tuning: (1) automatic speech recognition (ASR) models, where we demonstrate that incorporating high-expressivity data accelerates convergence and enhances learned acoustic representations, and (2) large end-to-end speech models, where both human and model-based evaluations reveal improved interactional and expressive capabilities post-finetuning.Our results underscore the potential of high-expressivity speech datasets in enhancing both task-specific performance and the overall communicative competence of speech AI systems.
# Method
## Filter for Usable
To ensure the quality and consistency of the audio data, we applied the following preprocessing steps:
Duration Filtering: Audio clips shorter than 5 seconds or longer than 15 seconds were excluded to maintain a consistent length range suitable for analysis.
Resampling: All audio files were resampled to a 16 kHz sampling rate, which is commonly used in speech processing tasks to balance quality and computational efficiency.
Channel Conversion: Stereo audio files were converted to mono by averaging the left and right channels. This step ensures uniformity across the dataset and simplifies subsequent processing.
## Filter for Transcription
We used Whisper-Large-v3-turbo to evaluate transcription quality, retaining only samples with a Word Error Rate (WER) below 0.1. This model was chosen for its strong performance and fast inference, making it suitable for large-scale filtering. The WER threshold ensures high-quality transcriptions and reduces noise for downstream tasks.
## Filter for Gender
Manual verification was conducted by four annotators. Only samples with unanimous agreement among all four were retained; others were discarded.
## Filter for Emotion
For both gender and emotion filtering, samples were manually reviewed by four annotators. Only those with unanimous agreement were kept.
# DataSource
## Split: only_gender_reliable
[CommonVoice](https://commonvoice.mozilla.org/)
[VCTK](https://datashare.is.ed.ac.uk/handle/10283/2651)
[LibriSpeech](https://www.openslr.org/12)
## Split: emotion_reliable
[CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D)
[RAVDESS](https://zenodo.org/records/1188976#.YFZuJ0j7SL8)
[MEAD](https://github.com/uniBruce/Mead)
[TESS](https://utoronto.scholaris.ca/collections/036db644-9790-4ed0-90cc-be1dfb8a4b66)
[SAVEE](http://kahlan.eps.surrey.ac.uk/savee/)
[ESD](https://hltsingapore.github.io/ESD/)
# 个人语音中枢(Personal Hub):通过时空特征融合与模型微调探索高表现力语音数据

# 引言
本研究提出个人语音中枢(Personal Hub),这是一种通过融合时空上下文与组合式属性控制来挖掘与利用高表现力语音数据的新型框架。本方法的核心为语音属性矩阵(Speech Attribute Matrix),可使标注人员系统地将年龄、性别、情绪、口音、环境等与说话人相关的特征与时序元数据相结合,从而构建具备丰富多样表现力特征的语音样本。
基于此矩阵驱动的数据采集范式,我们构建了多级表现力数据集,并依据属性组合的多样性与复杂度将其划分为三个层级。随后我们通过两类模型微调实验探究该精选数据集的优势:其一为自动语音识别(ASR)模型,实验表明引入高表现力语音数据可加速模型收敛并优化习得的声学表征;其二为大型端到端语音模型,经微调后,人工与自动化评估均显示其交互能力与表现力得到提升。本研究结果证实,高表现力语音数据集在提升语音AI系统的特定任务性能与整体沟通能力方面均具备应用潜力。
# 方法
## 可用数据筛选
为保障音频数据的质量与一致性,我们执行了以下预处理步骤:
时长筛选:剔除时长不足5秒或超过15秒的音频片段,以维持适配分析的统一时长范围。
重采样:将所有音频文件重采样至16 kHz采样率,该参数是语音处理任务中平衡音质与计算效率的常用设置。
声道转换:通过对左右声道取平均,将立体声音频转换为单声道音频。该步骤可保证数据集内数据的一致性,并简化后续处理流程。
## 转录质量筛选
我们采用Whisper-Large-v3-turbo模型评估转录质量,仅保留词错误率(Word Error Rate, WER)低于0.1的样本。选择该模型是因其性能优异且推理速度快,适配大规模筛选需求。该WER阈值可保障转录质量,为下游任务降低噪声干扰。
## 性别属性筛选
由四名标注人员开展人工核验,仅保留四名标注人员达成完全一致共识的样本,其余样本均予以剔除。
## 情绪属性筛选
性别与情绪属性筛选环节均由四名标注人员人工审核,仅保留达成完全一致共识的样本。
# 数据源
## 划分集:仅性别可靠子集
[CommonVoice](https://commonvoice.mozilla.org/)
[VCTK](https://datashare.is.ed.ac.uk/handle/10283/2651)
[LibriSpeech](https://www.openslr.org/12)
## 划分集:情绪可靠子集
[CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D)
[RAVDESS](https://zenodo.org/records/1188976#.YFZuJ0j7SL8)
[MEAD](https://github.com/uniBruce/Mead)
[TESS](https://utoronto.scholaris.ca/collections/036db644-9790-4ed0-90cc-be1dfb8a4b66)
[SAVEE](http://kahlan.eps.surrey.ac.uk/savee/)
[ESD](https://hltsingapore.github.io/ESD/)
提供机构:
maas
创建时间:
2025-05-18



