用于声音意图识别模型的训练数据

Name: 用于声音意图识别模型的训练数据
Creator: 杭州秋果计划科技有限公司
Published: 2025-09-01 11:14:18
License: 暂无描述

浙江省数据知识产权登记平台2025-09-01 更新2025-09-06 收录

下载链接：

https://www.zjip.org.cn/home/announce/trends/173187

下载链接

链接失效反馈

官方服务：

资源简介：

本训练数据用于声音意图识别模型的训练和优化。训练好的意图识别模型能够通过分析音频内容来理解说话者的意图。在智能眼镜的人机交互场景中，训练模型通过准确理解用户的意图，智能眼镜可以提供更加自然和直观的交互方式，从而通过意图识别模型增强智能眼镜的用户体验，提高交互效率。例如，用户只需简单地注视某个物体或说出一个指令，系统就能理解其意图并执行相应的操作，如拍照、查询信息等。(1) 数据收集：人工收集和生成的文本数据【text】、文本意图数据【intent】。 (2) 数据处理：以说话人编号【speaker_audio】对应的人声特征作为TTS模型的输入，不同的说话人编号对应有不同的人声特征，分别用多个TTS模型把文本数据转换成音频数据，并存储到存储路径【audio_path】；分别用下面四个ASR模型对音频数据进行语音识别：用qwen的ASR模型得到的语音识别结果【text_qwen】；用paddle的ASR模型得到的语音识别结果【text_paddle】；用whisper的ASR模型得到的语音识别结果【text_whisper】；用paraformer的ASR模型得到的语音识别结果【text_paraformer】；对以上4个语音识别结果进行片段投票得到结果【text_vote】，标记投票得到的结果的不同之处【diff_spans】，计算投票结果与文本【text】的一致率。 (3) 在标注好的数据集上训练ASR深度学习模型，把语音作为模型输入，文本【text】和文本的意图【intent】作为模型输出，基于whisper或paraformer架构的asr模型进行训练，得到训练好的ASR模型。训练好的ASR模型在接收到用户输入的音频数据后，可以得到用户的意图。 (4) 超参数调优：进行超参数调优，包括学习率、批量大小、网络层数等，以优化模型性能。 (5) 模型优化与验证：根据评估结果，对模型进行剪枝、正则化等优化措施。在独立的测试集上验证模型的性能，确保模型在未见数据上也能表现良好。

This training dataset is designed for training and optimizing speech intent recognition models. The well-trained intent recognition models can comprehend the speaker's intent by analyzing audio content. In the human-computer interaction scenario of smart glasses, enabling the trained model to accurately understand the user's intent allows the smart glasses to deliver a more natural and intuitive interaction mode, thereby enhancing user experience and improving interaction efficiency via the intent recognition module. For example, the system can recognize the user's intent and execute corresponding operations such as taking photos, querying information, etc., when the user simply gazes at a target object or utters a verbal command. (1) Data Collection: Manually collected and generated text data [text] and text intent annotation data [intent]. (2) Data Processing: Take the vocal features corresponding to the speaker ID [speaker_audio] as the input for the text-to-speech (TTS) models. Since different speaker IDs correspond to distinct vocal features, multiple TTS models are employed to convert the collected text data into audio data, which are then stored in the designated storage path [audio_path]. Four different automatic speech recognition (ASR) models are used to conduct speech recognition on the generated audio data respectively: - Speech recognition result from Qwen ASR model: [text_qwen] - Speech recognition result from Paddle ASR model: [text_paddle] - Speech recognition result from Whisper ASR model: [text_whisper] - Speech recognition result from Paraformer ASR model: [text_paraformer] Subsequently, perform segment voting on the four aforementioned speech recognition results to obtain the final consensus result [text_vote], mark the differing spans among the voting results [diff_spans], and calculate the consistency rate between the voting consensus result and the original text [text]. (3) Model Training: Train an ASR deep learning model on the annotated dataset, with speech audio as the model input, and the original text [text] and its corresponding intent [intent] as the model output targets. The training is implemented based on the Whisper or Paraformer ASR architectures to obtain a fully trained ASR model. Once receiving audio input from the user, the trained ASR model can infer the user's intent accurately. (4) Hyperparameter Tuning: Conduct hyperparameter tuning including learning rate, batch size, network layer count, etc., to optimize the model's performance. (5) Model Optimization and Validation: Implement model optimization measures such as pruning and regularization based on the evaluation results. Validate the model's performance on an independent test set to ensure that the model exhibits robust performance on unseen data.

提供机构：

杭州秋果计划科技有限公司

创建时间：

2025-06-12

搜集汇总

数据集介绍