spmis

Name: spmis
Creator: maas
Published: 2025-11-23 02:02:12
License: 暂无描述

魔搭社区2025-11-23 更新2024-11-02 收录

下载链接：

https://modelscope.cn/datasets/amphion/spmis

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for SpMis: Synthetic Spoken Misinformation Dataset  The **SpMis Dataset** is designed to facilitate research on detecting synthetic spoken misinformation. It includes 360,611 audio samples synthesized from over 1,000 speakers across five major topics: Politics, Medicine, Education, Laws, and Finance, with 8,681 samples labeled as misinformation. ## Dataset Details ### Dataset Description  This dataset contains synthetic spoken audio clips generated using state-of-the-art TTS models such as Amphion and OpenVoice v2, labeled to indicate whether the speech is genuine or misinformation. The dataset is designed to assist in the development of models capable of detecting both synthetic speech and misinformation. ### Dataset Sources  - **Repository:** https://huggingface.co/datasets/amphion/spmis - **Paper :** https://arxiv.org/abs/2409.11308 ## Uses ### Direct Use  The dataset is intended to be used for training and evaluating models in tasks: - **Misinformation detection**: Identifying whether the spoken voice is intended to mislead. - When using it, you need to convert the mp3 file into a .wav file and change the file name to spmis_data. ## Dataset Structure  The dataset includes: - **Audio files**: 360,611 TTS-generated speech samples. - **Labels**: Misinformation, ordinary speech, or synthesized celebrity speech. - **Metadata**: Speaker identity, topic, duration, and language. The dataset is divided into the following topics: 1. **Politics**: 76,542 samples, 1,740 labeled as misinformation. 2. **Medicine**: 21,836 samples, 740 labeled as misinformation. 3. **Education**: 177,392 samples, 2,970 labeled as misinformation. 4. **Laws**: 11,422 samples, 862 labeled as misinformation. 5. **Finance**: 53,011 samples, 2,369 labeled as misinformation. 6. **Other**: 20,408 samples with no misinformation labels. ## Dataset Creation ### Curation Rationale  The dataset was created to provide a resource for training models capable of detecting synthetic spoken misinformation, which is becoming an increasing threat in the era of deepfake technologies. ### Source Data #### Data Collection and Processing  The audio was generated using the Amphion and OpenVoice v2 TTS models, utilizing large-scale public corpora from various sources. The data was curated and processed to ensure a balance between topics and labeled misinformation. #### Who are the source data producers?  The data was generated using synthetic voices, and no real-world speakers are associated with the content. All voices were created through TTS systems, using speaker embeddings derived from publicly available corpora. ## Bias, Risks, and Limitations  The dataset may not fully represent all types of misinformation, and models trained on this dataset may be biased towards detecting synthetic voices generated by specific TTS systems. ### Recommendations  We recommend using this dataset as part of a larger framework for misinformation detection. It should be combined with real-world data to improve generalization. ## Citation **BibTeX:** ```bibtex @inproceedings{liu2024spmis, title={SpMis: An Investigation of Synthetic Spoken Misinformation Detection}, author={Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu}, booktitle={Proceedings of SLT 2024}, year={2024}, }

# SpMis数据集卡片：合成口语虚假信息数据集  **SpMis数据集**旨在推动合成口语虚假信息检测相关研究。该数据集包含360,611条音频样本，由超过1000位说话人生成，涵盖政治、医疗、教育、法律、金融五大核心主题，其中8,681条样本被标记为虚假信息。 ## 数据集详情 ### 数据集概述  本数据集包含使用当前顶尖文本到语音（Text-to-Speech, TTS）模型（如Amphion与OpenVoice v2）生成的合成口语音频片段，并标注了语音内容是否为真实信息或虚假信息。本数据集旨在助力开发能够同时检测合成语音与虚假信息的模型。 ### 数据集来源  - **代码仓库**：https://huggingface.co/datasets/amphion/spmis - **相关论文**：https://arxiv.org/abs/2409.11308 ## 应用场景 ### 直接应用  本数据集适用于以下任务的模型训练与评估： - **虚假信息检测**：识别语音内容是否存在误导意图。使用时需将mp3格式文件转换为.wav格式，并将文件名统一修改为spmis_data。 ## 数据集结构  本数据集包含以下内容： - **音频文件**：360,611条由TTS模型生成的语音样本。 - **标签**：分为虚假信息、普通语音或合成名人语音三类。 - **元数据**：包含说话人身份、主题、时长与语言信息。数据集按主题划分为以下类别： 1. **政治**：76,542条样本，其中1,740条被标记为虚假信息。 2. **医疗**：21,836条样本，其中740条被标记为虚假信息。 3. **教育**：177,392条样本，其中2,970条被标记为虚假信息。 4. **法律**：11,422条样本，其中862条被标记为虚假信息。 5. **金融**：53,011条样本，其中2,369条被标记为虚假信息。 6. **其他**：20,408条无虚假信息标注的样本。 ## 数据集构建 ### 构建初衷  本数据集的构建旨在为检测合成口语虚假信息的模型提供训练资源——在深度伪造技术盛行的当下，这类虚假信息已构成日益严峻的威胁。 ### 源数据 #### 数据收集与处理流程  本数据集的音频通过Amphion与OpenVoice v2这两款TTS模型生成，所用数据源自各类大规模公开语料库。数据集经过整理与处理，确保各主题与虚假信息标注样本的分布均衡。 #### 源数据生成方  本数据集的数据均通过合成语音生成，未关联真实说话人。所有语音均由TTS系统生成，其说话人嵌入向量源自公开可用的语料库。 ## 偏倚、风险与局限性  本数据集可能无法覆盖所有类型的虚假信息，且基于本数据集训练的模型可能对特定TTS系统生成的合成语音存在检测偏倚。 ### 建议  我们建议将本数据集作为虚假信息检测整体框架的一部分，需结合真实世界数据以提升模型的泛化能力。 ## 引用 **BibTeX格式：** bibtex @inproceedings{liu2024spmis, title={SpMis: An Investigation of Synthetic Spoken Misinformation Detection}, author={Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu}, booktitle={Proceedings of SLT 2024}, year={2024}, }

提供机构：

maas

创建时间：

2024-10-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集