five

spmis

收藏
魔搭社区2025-11-23 更新2024-11-02 收录
下载链接:
https://modelscope.cn/datasets/amphion/spmis
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for SpMis: Synthetic Spoken Misinformation Dataset <!-- Provide a quick summary of the dataset. --> The **SpMis Dataset** is designed to facilitate research on detecting synthetic spoken misinformation. It includes 360,611 audio samples synthesized from over 1,000 speakers across five major topics: Politics, Medicine, Education, Laws, and Finance, with 8,681 samples labeled as misinformation. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> This dataset contains synthetic spoken audio clips generated using state-of-the-art TTS models such as Amphion and OpenVoice v2, labeled to indicate whether the speech is genuine or misinformation. The dataset is designed to assist in the development of models capable of detecting both synthetic speech and misinformation. ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://huggingface.co/datasets/amphion/spmis - **Paper :** https://arxiv.org/abs/2409.11308 ## Uses ### Direct Use <!-- This section describes suitable use cases for the dataset. --> The dataset is intended to be used for training and evaluating models in tasks: - **Misinformation detection**: Identifying whether the spoken voice is intended to mislead. - When using it, you need to convert the mp3 file into a .wav file and change the file name to spmis_data. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset includes: - **Audio files**: 360,611 TTS-generated speech samples. - **Labels**: Misinformation, ordinary speech, or synthesized celebrity speech. - **Metadata**: Speaker identity, topic, duration, and language. The dataset is divided into the following topics: 1. **Politics**: 76,542 samples, 1,740 labeled as misinformation. 2. **Medicine**: 21,836 samples, 740 labeled as misinformation. 3. **Education**: 177,392 samples, 2,970 labeled as misinformation. 4. **Laws**: 11,422 samples, 862 labeled as misinformation. 5. **Finance**: 53,011 samples, 2,369 labeled as misinformation. 6. **Other**: 20,408 samples with no misinformation labels. ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> The dataset was created to provide a resource for training models capable of detecting synthetic spoken misinformation, which is becoming an increasing threat in the era of deepfake technologies. ### Source Data #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> The audio was generated using the Amphion and OpenVoice v2 TTS models, utilizing large-scale public corpora from various sources. The data was curated and processed to ensure a balance between topics and labeled misinformation. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> The data was generated using synthetic voices, and no real-world speakers are associated with the content. All voices were created through TTS systems, using speaker embeddings derived from publicly available corpora. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> The dataset may not fully represent all types of misinformation, and models trained on this dataset may be biased towards detecting synthetic voices generated by specific TTS systems. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> We recommend using this dataset as part of a larger framework for misinformation detection. It should be combined with real-world data to improve generalization. ## Citation **BibTeX:** ```bibtex @inproceedings{liu2024spmis, title={SpMis: An Investigation of Synthetic Spoken Misinformation Detection}, author={Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu}, booktitle={Proceedings of SLT 2024}, year={2024}, }

# SpMis数据集卡片:合成口语虚假信息数据集 <!-- 提供数据集的快速概述。 --> **SpMis数据集**旨在推动合成口语虚假信息检测相关研究。该数据集包含360,611条音频样本,由超过1000位说话人生成,涵盖政治、医疗、教育、法律、金融五大核心主题,其中8,681条样本被标记为虚假信息。 ## 数据集详情 ### 数据集概述 <!-- 提供数据集的详细说明。 --> 本数据集包含使用当前顶尖文本到语音(Text-to-Speech, TTS)模型(如Amphion与OpenVoice v2)生成的合成口语音频片段,并标注了语音内容是否为真实信息或虚假信息。本数据集旨在助力开发能够同时检测合成语音与虚假信息的模型。 ### 数据集来源 <!-- 提供数据集的基础链接。 --> - **代码仓库**:https://huggingface.co/datasets/amphion/spmis - **相关论文**:https://arxiv.org/abs/2409.11308 ## 应用场景 ### 直接应用 <!-- 本节描述数据集的适用用例。 --> 本数据集适用于以下任务的模型训练与评估: - **虚假信息检测**:识别语音内容是否存在误导意图。 使用时需将mp3格式文件转换为.wav格式,并将文件名统一修改为spmis_data。 ## 数据集结构 <!-- 本节描述数据集的字段信息,以及数据集划分标准、数据点间关系等额外结构信息。 --> 本数据集包含以下内容: - **音频文件**:360,611条由TTS模型生成的语音样本。 - **标签**:分为虚假信息、普通语音或合成名人语音三类。 - **元数据**:包含说话人身份、主题、时长与语言信息。 数据集按主题划分为以下类别: 1. **政治**:76,542条样本,其中1,740条被标记为虚假信息。 2. **医疗**:21,836条样本,其中740条被标记为虚假信息。 3. **教育**:177,392条样本,其中2,970条被标记为虚假信息。 4. **法律**:11,422条样本,其中862条被标记为虚假信息。 5. **金融**:53,011条样本,其中2,369条被标记为虚假信息。 6. **其他**:20,408条无虚假信息标注的样本。 ## 数据集构建 ### 构建初衷 <!-- 数据集创建的动机。 --> 本数据集的构建旨在为检测合成口语虚假信息的模型提供训练资源——在深度伪造技术盛行的当下,这类虚假信息已构成日益严峻的威胁。 ### 源数据 #### 数据收集与处理流程 <!-- 本节描述数据收集与处理过程,如数据选择标准、过滤与归一化方法、所用工具与库等。 --> 本数据集的音频通过Amphion与OpenVoice v2这两款TTS模型生成,所用数据源自各类大规模公开语料库。数据集经过整理与处理,确保各主题与虚假信息标注样本的分布均衡。 #### 源数据生成方 <!-- 本节描述原始创建数据的个人或系统。若可获取,还应包含源数据创建者自行报告的人口统计或身份信息。 --> 本数据集的数据均通过合成语音生成,未关联真实说话人。所有语音均由TTS系统生成,其说话人嵌入向量源自公开可用的语料库。 ## 偏倚、风险与局限性 <!-- 本节旨在说明技术与社会技术层面的局限性。 --> 本数据集可能无法覆盖所有类型的虚假信息,且基于本数据集训练的模型可能对特定TTS系统生成的合成语音存在检测偏倚。 ### 建议 <!-- 本节旨在针对偏倚、风险与技术局限性给出建议。 --> 我们建议将本数据集作为虚假信息检测整体框架的一部分,需结合真实世界数据以提升模型的泛化能力。 ## 引用 **BibTeX格式:** bibtex @inproceedings{liu2024spmis, title={SpMis: An Investigation of Synthetic Spoken Misinformation Detection}, author={Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu}, booktitle={Proceedings of SLT 2024}, year={2024}, }
提供机构:
maas
创建时间:
2024-10-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作