x_sensitive

Name: x_sensitive
Creator: maas
Published: 2025-12-04 16:53:35
License: 暂无描述

魔搭社区2025-12-04 更新2025-11-22 收录

下载链接：

https://modelscope.cn/datasets/cardiffnlp/x_sensitive

下载链接

链接失效反馈

官方服务：

资源简介：

***X-Sensitive*** is a multi-label dataset designed to identify sensitive language in social media. It consists of 7 labels and includes a total of 8,000 posts extracted from ***X***. Each post is assigned one or more of the following labels based on its content: ***Drugs, Sex, Conflictual, Spam, Profanity, and Self-harm***. More details in the [reference paper](https://arxiv.org/abs/2411.19832). The goal of ***X-Sensitive*** is to serve as a valuable resource for developing online moderation tools. The following models have been trained on ***X-Sensitive*** with this aim: - [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel) - [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel) We also provide binary versions of the models, where each post is classified as either sensitive or not-sensitive: - [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary) - [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary) ## Dataset Structure ### Data Splits | Name | #Entries | |--------------|---------------| | ***train*** | 5,000 | | ***test*** | 2,000 | | ***validation*** | 1,000 | ### Data Instances An example of `train` looks as follows. ```python {'#labels': 1, 'conflictual': 0, 'conflictual_highlight': array([], dtype=object), 'drugs': 0, 'drugs_highlight': array([], dtype=object), 'keyword': 'fuckin', 'labels': array(['profanity'], dtype=object), 'profanity': 1, 'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object), array(['fucking'], dtype=object)], dtype=object), 'selfharm': 0, 'selfharm_highlight': array([], dtype=object), 'sex': 0, 'sex_highlight': array([], dtype=object), 'spam': 0, 'spam_highlight': array([], dtype=object), 'text': 'i think the idea of aliens is so fucking cool'} ``` ### Labels | Label Number | Label Name | Description |--------------|---------------|---------------| | 0 | conflictual | Conflictual language. An attack based on protected (race, color, caster, gender, etc) or other categories.| | 1 | profanity | Language containing slurs and profanity even if they are not directed towards a specific entity.| | 2 | sex | Sexually Explicit Content. Pornographic or other types of sexual content. | 4 | selfharm | Self-harm. Posts depicting, promoting, or glorifying violence or harm against oneself such as eating disorders or suicide. | 5 | spam | Irrelevant content that is unsolicited. ## Citation Information ``` @article{antypas2024sensitive, title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation}, author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco}, journal={arXiv preprint arXiv:2411.19832}, year={2024} } ```

**X-Sensitive**是一款用于识别社交媒体敏感语言的多标签数据集。该数据集包含7个标签，共计8000条从***X***平台提取的帖子。每条帖子将根据其内容被标注一个或多个以下标签：毒品（Drugs）、性内容（Sex）、冲突性言论（Conflictual）、垃圾信息（Spam）、亵渎性语言（Profanity）以及自残相关（Self-harm）。更多细节可参阅[参考论文](https://arxiv.org/abs/2411.19832)。 X-Sensitive的设计目标是作为开发网络内容审核工具的优质资源。基于该数据集已训练得到以下模型以实现此目标： - [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel) - [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel) 我们还提供了该系列模型的二分类版本，可将每条帖子划分为“敏感”或“非敏感”两类： - [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary) - [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary) ## 数据集结构 ### 数据划分 | 划分集名称 | 样本数量 | |--------------|---------------| | 训练集（train） | 5,000 | | 测试集（test） | 2,000 | | 验证集（validation） | 1,000 | ### 数据样例训练集的一条样例如以下Python代码所示： python {'#labels': 1, 'conflictual': 0, 'conflictual_highlight': array([], dtype=object), 'drugs': 0, 'drugs_highlight': array([], dtype=object), 'keyword': 'fuckin', 'labels': array(['profanity'], dtype=object), 'profanity': 1, 'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object), array(['fucking'], dtype=object)], dtype=object), 'selfharm': 0, 'selfharm_highlight': array([], dtype=object), 'sex': 0, 'sex_highlight': array([], dtype=object), 'spam': 0, 'spam_highlight': array([], dtype=object), 'text': 'i think the idea of aliens is so fucking cool'} ### 标签说明 | 标签编号 | 标签名称 | 标签描述 |--------------|---------------|---------------| | 0 | 冲突性言论（conflictual） | 基于受保护群体（种族、肤色、种姓、性别等）或其他类别发起的攻击性言论。| | 1 | 亵渎性语言（profanity） | 含辱骂性词汇与亵渎性语言，即便其未针对特定对象。| | 2 | 性内容（sex） | 露骨性内容，包括色情或其他类型的性相关内容。 | 4 | 自残相关（selfharm） | 自残相关内容，即描述、宣扬或美化针对自身的暴力或伤害行为（如进食障碍或自杀）的帖子。 | 5 | 垃圾信息（spam） | 未经请求的无关内容。 ## 引用信息 @article{antypas2024sensitive, title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation}, author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco}, journal={arXiv preprint arXiv:2411.19832}, year={2024} }

提供机构：

maas

创建时间：

2025-10-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集