下载链接：

https://modelscope.cn/datasets/stepfun-ai/StepEval-Audio-Paralinguistic

下载链接

链接失效反馈

官方服务：

资源简介：

# StepEval-Audio-Paralinguistic Dataset Paper: [Step-Audio 2 Technical Report](https://huggingface.co/papers/2507.16632) Code: https://github.com/stepfun-ai/Step-Audio2 Project Page: https://www.stepfun.com/docs/en/step-audio2 ## Overview StepEval-Audio-Paralinguistic is a speech-to-speech benchmark designed to evaluate AI models' understanding of paralinguistic information in speech across 11 distinct dimensions. The dataset contains 550 carefully curated and annotated speech samples for assessing capabilities beyond semantic understanding. ## Key Features - **Comprehensive coverage**: 11 paralinguistic dimensions with 50 samples each - **Diverse sources**: Combines podcast recordings with AudioSet, CochlScene, and VocalSound samples - **High-quality annotations**: Professionally verified open-set natural language descriptions - **Challenging construction**: Includes synthesized question mixing and audio augmentation - **Standardized evaluation**: Comes with automatic evaluation protocols ## Dataset Composition ### Core Categories 1. **Basic Attributes** - Gender identification - Age classification - Timbre description 2. **Speech Characteristics** - Emotion recognition - Pitch classification - Rhythm patterns - Speaking speed - Speaking style 3. **Environmental Sounds** - Scenario detection - Sound event recognition - Vocal sound identification ### Task Categories and Label Distributions | Category | Task Description | Label Distribution | Total Samples | |----------------|--------------------------------------|------------------------------------------------------------------------------------|---------------| | **Gender** | Identify speaker's gender | Male: 25, Female: 25 | 50 | | **Age** | Classify speaker's age | 20y:6, 25y:6, 30y:5, 35y:5, 40y:5, 45y:4, 50y:4 + Child:7, Elderly:8 | 50 | | **Speed** | Categorize speaking speed | Slow:10, Medium-slow:10, Medium:10, Medium-fast:10, Fast:10 | 50 | | **Emotion** | Recognize emotional states | Anger, Joy, Sadness, Surprise, Sarcasm, etc. (50 manually annotated) | 50 | | **Scenarios** | Detect background scenes | Indoor:14, Outdoor:12, Restaurant:6, Kitchen:6, Park:6, Subway:6 | 50 | | **Vocal** | Identify non-speech vocal effects | Cough:14, Sniff:8, Sneeze:7, Throat-clearing:6, Laugh:5, Sigh:5, Other:5 | 50 | | **Style** | Distinguish speaking styles | Dialogue:4, Discussion:4, Narration:8, Commentary:8, Colloquial:8, Speech:8, Other:10 | 50 | | **Rhythm** | Characterize rhythm patterns | Steady:10, Fluent:10, Paused:10, Hurried:10, Fluctuating:10 | 50 | | **Pitch** | Classify dominant pitch ranges | Mid:12, Mid-high:14, High:12, Mid-low:12 | 50 | | **Event** | Recognize non-vocal audio events | Music:8, Other events:42 (from AudioSet) | 50 | **Dataset Notes:** - Total samples: 550 (50 per category × 11 categories) - Underrepresented categories were augmented to ensure diversity - Scene/event categories use synthetic audio mixing with controlled parameters - All audio samples are ≤30 seconds in duration ## Data Collection & Processing ### Preprocessing Pipeline - All audio resampled to 24,000 Hz - Strict duration control (≤30 seconds) - Demographic balancing for underrepresented groups - Professional annotation verification ### Special Enhancements - **Scenario**: 6 environmental types mixed (from CochlScene) - **Event**: AudioSet samples mixed - **Vocal**: 7 paralinguistic types inserted (from VocalSound) ## Dataset Construction 1. Collected raw speech samples from diverse sources 2. Generated text-based QA pairs aligned with annotations 3. Converted QAs to audio using TTS synthesis 4. Randomly inserted question clips before/after original utterances 5. For environmental sounds: additional audio mixing before question concatenation ## Evaluation Protocol The benchmark evaluation follows a standardized three-phase process: ### 1. Model Response Collection Audio-in/audio-out models are queried through their APIs using the original audio files as input. Each 24kHz audio sample (≤30s duration) generates a corresponding response audio, saved with matching filenames for traceability. ### 2. Speech-to-Text Conversion All model response audios are transcribed using a ASR system. Transcripts undergo automatic text normalization and are stored. ### 3. Automated Assessment The evaluation script (`LLM_judge.py`) compares ASR transcripts against ground truth annotations using an LLM judge. Scoring considers semantic similarity rather than exact matches, with partial credit for partially correct responses. The final metrics include per-category accuracy scores. ### Benchmark Results on StepEval-Audio-Paralinguistic | Model | Avg | Gender | Age | Timbre | Scenario | Event | Emotion | Pitch | Rhythm | Speed | Style | Vocal | |------------------|------|--------|-----|--------|----------|-------|---------|-------|--------|-------|-------|-------| | GPT-4o Audio | 43.45| 18 | 42 | 34 | 22 | 14 | 82 | 40 | 60 | 58 | 64 | 44 | | Kimi-Audio | 49.64| 94 | 50 | 10 | 30 | 48 | 66 | 56 | 40 | 44 | 54 | 54 | | Qwen-Omni | 44.18| 40 | 50 | 16 | 28 | 42 | 76 | 32 | 54 | 50 | 50 | 48 | | Step-Audio-AQAA | 36.91| 70 | 66 | 18 | 14 | 14 | 40 | 38 | 48 | 54 | 44 | 0 | | **Step-Audio 2** | **76.55**| **98**| **92**| **78** | **64** | 46 | 72 | **78**| **70** | **78**| **84**| **82**|

# StepEval-Audio-Paralinguistic 数据集论文：[Step-Audio 2 技术报告](https://huggingface.co/papers/2507.16632) 代码：https://github.com/stepfun-ai/Step-Audio2 项目页面：https://www.stepfun.com/docs/en/step-audio2 ## 概述 StepEval-Audio-Paralinguistic 是一款语音到语音（speech-to-speech）基准测试集，旨在评估人工智能模型在11个不同维度上对语音副语言信息的理解能力。该数据集包含550份经过精心筛选与标注的语音样本，用于评测模型超越语义理解的各项能力。 ## 核心特性 - **全面覆盖**：涵盖11个副语言维度，每个维度包含50份样本 - **来源多元**：整合播客录制内容与AudioSet、CochlScene、VocalSound数据集的语音样本 - **标注优质**：经过专业核验的开放集自然语言描述标注 - **构建严谨**：包含合成式问题混合与音频增强处理 - **评估标准化**：配套自动化评估协议 ## 数据集构成 ### 核心类别 1. **基础属性** - 性别识别 - 年龄分类 - 音色描述 2. **语音特征** - 情绪识别 - 音高分类 - 节奏模式 - 说话速度 - 说话风格 3. **环境声音** - 场景检测 - 声音事件识别 - 非语音人声效果识别 ### 任务类别与标签分布 | 类别 | 任务描述 | 标签分布 | 总样本数 | |------------|------------------------------|--------------------------------------------------------------------------|----------| | **性别** | 识别说话人性别 | 男性：25，女性：25 | 50 | | **年龄** | 对说话人年龄进行分类 | 20岁：6，25岁：6，30岁：5，35岁：5，40岁：5，45岁：4，50岁：4 + 儿童：7，老年：8 | 50 | | **速度** | 对说话速度进行分类 | 慢速：10，中慢速：10，中等：10，中快速：10，快速：10 | 50 | | **情绪** | 识别情绪状态 | 愤怒、喜悦、悲伤、惊讶、讽刺等（共50份人工标注样本） | 50 | | **场景** | 检测背景场景 | 室内：14，室外：12，餐厅：6，厨房：6，公园：6，地铁：6 | 50 | | **人声** | 识别非语音人声效果 | 咳嗽：14，抽鼻子：8，喷嚏：7，清嗓子：6，笑：5，叹气：5，其他：5 | 50 | | **风格** | 区分说话风格 | 对话：4，讨论：4，叙述：8，解说：8，口语化表达：8，演讲：8，其他：10 | 50 | | **节奏** | 刻画节奏模式 | 平稳：10，流畅：10，停顿：10，急促：10，波动：10 | 50 | | **音高** | 对主导音高范围进行分类 | 中音：12，中高音：14，高音：12，中低音：12 | 50 | | **事件** | 识别非语音音频事件 | 音乐：8，其他事件：42（源自AudioSet） | 50 | **数据集说明**： - 总样本量：550份（每个类别50份 × 11个类别） - 对样本量不足的类别进行了增强处理，以保证数据集多样性 - 场景/事件类别采用参数可控的合成音频混合技术 - 所有语音样本时长均不超过30秒 ## 数据采集与处理 ### 预处理流程 - 所有音频均重采样至24000 Hz - 严格控制音频时长（≤30秒） - 对代表性不足的群体进行人口统计学平衡处理 - 专业标注核验 ### 特殊增强手段 - **场景类别**：混合6种环境类型音频（源自CochlScene） - **事件类别**：混合AudioSet数据集的语音样本 - **人声类别**：插入7种副语言类型的音频（源自VocalSound） ## 数据集构建流程 1. 从多元数据源采集原始语音样本 2. 生成与标注信息对齐的文本问答对 3. 使用文本转语音（Text-to-Speech, TTS）合成技术将问答对转换为音频 4. 在原始语音片段前后随机插入问答音频片段 5. 针对环境声音类别：在拼接问答片段前额外进行音频混合处理 ## 评估协议该基准测试采用标准化的三阶段评估流程： ### 1. 模型响应采集通过API接口调用语音输入-语音输出模型，以原始音频文件作为输入。每份24kHz、时长≤30秒的音频样本将生成对应的响应音频，保存为匹配的文件名以保证可追溯性。 ### 2. 语音转文字转换使用自动语音识别（Automatic Speech Recognition, ASR）系统对所有模型响应音频进行转录，转录结果将经过自动文本归一化处理后存储。 ### 3. 自动化评估评估脚本`LLM_judge.py`利用大语言模型（Large Language Model, LLM）作为评判者，将ASR转录结果与基准标注进行比对。评分将考虑语义相似度而非精确匹配，对部分正确的响应给予部分分数。最终评估指标包含各分类别准确率得分。 ### StepEval-Audio-Paralinguistic 基准测试结果 | 模型名称 | 平均得分 | 性别 | 年龄 | 音色 | 场景检测 | 声音事件 | 情绪识别 | 音高 | 节奏 | 说话速度 | 说话风格 | 非语音人声效果 | |-------------------|----------|------|-----|------|----------|----------|----------|------|------|----------|----------|----------------| | GPT-4o Audio | 43.45 | 18 | 42 | 34 | 22 | 14 | 82 | 40 | 60 | 58 | 64 | 44 | | Kimi-Audio | 49.64 | 94 | 50 | 10 | 30 | 48 | 66 | 56 | 40 | 44 | 54 | 54 | | Qwen-Omni | 44.18 | 40 | 50 | 16 | 28 | 42 | 76 | 32 | 54 | 50 | 50 | 48 | | Step-Audio-AQAA | 36.91 | 70 | 66 | 18 | 14 | 14 | 40 | 38 | 48 | 54 | 44 | 0 | | **Step-Audio 2** | **76.55**| **98**|**92**|**78**| **64** | 46 | 72 |**78**|**70**| **78** | **84** | **82** |

应用场景：