five

x_sensitive

收藏
魔搭社区2025-12-04 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/cardiffnlp/x_sensitive
下载链接
链接失效反馈
官方服务:
资源简介:
***X-Sensitive*** is a multi-label dataset designed to identify sensitive language in social media. It consists of 7 labels and includes a total of 8,000 posts extracted from ***X***. Each post is assigned one or more of the following labels based on its content: ***Drugs, Sex, Conflictual, Spam, Profanity, and Self-harm***. More details in the [reference paper](https://arxiv.org/abs/2411.19832). The goal of ***X-Sensitive*** is to serve as a valuable resource for developing online moderation tools. The following models have been trained on ***X-Sensitive*** with this aim: - [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel) - [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel) We also provide binary versions of the models, where each post is classified as either sensitive or not-sensitive: - [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary) - [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary) ## Dataset Structure ### Data Splits | Name | #Entries | |--------------|---------------| | ***train*** | 5,000 | | ***test*** | 2,000 | | ***validation*** | 1,000 | ### Data Instances An example of `train` looks as follows. ```python {'#labels': 1, 'conflictual': 0, 'conflictual_highlight': array([], dtype=object), 'drugs': 0, 'drugs_highlight': array([], dtype=object), 'keyword': 'fuckin', 'labels': array(['profanity'], dtype=object), 'profanity': 1, 'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object), array(['fucking'], dtype=object)], dtype=object), 'selfharm': 0, 'selfharm_highlight': array([], dtype=object), 'sex': 0, 'sex_highlight': array([], dtype=object), 'spam': 0, 'spam_highlight': array([], dtype=object), 'text': 'i think the idea of aliens is so fucking cool'} ``` ### Labels | Label Number | Label Name | Description |--------------|---------------|---------------| | 0 | conflictual | Conflictual language. An attack based on protected (race, color, caster, gender, etc) or other categories.| | 1 | profanity | Language containing slurs and profanity even if they are not directed towards a specific entity.| | 2 | sex | Sexually Explicit Content. Pornographic or other types of sexual content. | 4 | selfharm | Self-harm. Posts depicting, promoting, or glorifying violence or harm against oneself such as eating disorders or suicide. | 5 | spam | Irrelevant content that is unsolicited. ## Citation Information ``` @article{antypas2024sensitive, title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation}, author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco}, journal={arXiv preprint arXiv:2411.19832}, year={2024} } ```

**X-Sensitive**是一款用于识别社交媒体敏感语言的多标签数据集。该数据集包含7个标签,共计8000条从***X***平台提取的帖子。每条帖子将根据其内容被标注一个或多个以下标签:毒品(Drugs)、性内容(Sex)、冲突性言论(Conflictual)、垃圾信息(Spam)、亵渎性语言(Profanity)以及自残相关(Self-harm)。更多细节可参阅[参考论文](https://arxiv.org/abs/2411.19832)。 X-Sensitive的设计目标是作为开发网络内容审核工具的优质资源。基于该数据集已训练得到以下模型以实现此目标: - [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel) - [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel) 我们还提供了该系列模型的二分类版本,可将每条帖子划分为“敏感”或“非敏感”两类: - [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary) - [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary) ## 数据集结构 ### 数据划分 | 划分集名称 | 样本数量 | |--------------|---------------| | 训练集(train) | 5,000 | | 测试集(test) | 2,000 | | 验证集(validation) | 1,000 | ### 数据样例 训练集的一条样例如以下Python代码所示: python {'#labels': 1, 'conflictual': 0, 'conflictual_highlight': array([], dtype=object), 'drugs': 0, 'drugs_highlight': array([], dtype=object), 'keyword': 'fuckin', 'labels': array(['profanity'], dtype=object), 'profanity': 1, 'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object), array(['fucking'], dtype=object)], dtype=object), 'selfharm': 0, 'selfharm_highlight': array([], dtype=object), 'sex': 0, 'sex_highlight': array([], dtype=object), 'spam': 0, 'spam_highlight': array([], dtype=object), 'text': 'i think the idea of aliens is so fucking cool'} ### 标签说明 | 标签编号 | 标签名称 | 标签描述 |--------------|---------------|---------------| | 0 | 冲突性言论(conflictual) | 基于受保护群体(种族、肤色、种姓、性别等)或其他类别发起的攻击性言论。| | 1 | 亵渎性语言(profanity) | 含辱骂性词汇与亵渎性语言,即便其未针对特定对象。| | 2 | 性内容(sex) | 露骨性内容,包括色情或其他类型的性相关内容。 | 4 | 自残相关(selfharm) | 自残相关内容,即描述、宣扬或美化针对自身的暴力或伤害行为(如进食障碍或自杀)的帖子。 | 5 | 垃圾信息(spam) | 未经请求的无关内容。 ## 引用信息 @article{antypas2024sensitive, title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation}, author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco}, journal={arXiv preprint arXiv:2411.19832}, year={2024} }
提供机构:
maas
创建时间:
2025-10-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作