x_sensitive
收藏魔搭社区2025-12-04 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/cardiffnlp/x_sensitive
下载链接
链接失效反馈官方服务:
资源简介:
***X-Sensitive*** is a multi-label dataset designed to identify sensitive language in social media.
It consists of 7 labels and includes a total of 8,000 posts extracted from ***X***.
Each post is assigned one or more of the following labels based on its content: ***Drugs, Sex, Conflictual, Spam, Profanity, and Self-harm***.
More details in the [reference paper](https://arxiv.org/abs/2411.19832).
The goal of ***X-Sensitive*** is to serve as a valuable resource for developing online moderation tools. The following models have been trained on ***X-Sensitive*** with this aim:
- [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel)
- [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel)
We also provide binary versions of the models, where each post is classified as either sensitive or not-sensitive:
- [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary)
- [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary)
## Dataset Structure
### Data Splits
| Name | #Entries |
|--------------|---------------|
| ***train*** | 5,000 |
| ***test*** | 2,000 |
| ***validation*** | 1,000 |
### Data Instances
An example of `train` looks as follows.
```python
{'#labels': 1,
'conflictual': 0,
'conflictual_highlight': array([], dtype=object),
'drugs': 0,
'drugs_highlight': array([], dtype=object),
'keyword': 'fuckin',
'labels': array(['profanity'], dtype=object),
'profanity': 1,
'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object),
array(['fucking'], dtype=object)], dtype=object),
'selfharm': 0,
'selfharm_highlight': array([], dtype=object),
'sex': 0,
'sex_highlight': array([], dtype=object),
'spam': 0,
'spam_highlight': array([], dtype=object),
'text': 'i think the idea of aliens is so fucking cool'}
```
### Labels
| Label Number | Label Name | Description
|--------------|---------------|---------------|
| 0 | conflictual | Conflictual language. An attack based on protected (race, color, caster, gender, etc) or other categories.|
| 1 | profanity | Language containing slurs and profanity even if they are not directed towards a specific entity.|
| 2 | sex | Sexually Explicit Content. Pornographic or other types of sexual content.
| 4 | selfharm | Self-harm. Posts depicting, promoting, or glorifying violence or harm against oneself such as eating disorders or suicide.
| 5 | spam | Irrelevant content that is unsolicited.
## Citation Information
```
@article{antypas2024sensitive,
title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation},
author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco},
journal={arXiv preprint arXiv:2411.19832},
year={2024}
}
```
**X-Sensitive**是一款用于识别社交媒体敏感语言的多标签数据集。该数据集包含7个标签,共计8000条从***X***平台提取的帖子。每条帖子将根据其内容被标注一个或多个以下标签:毒品(Drugs)、性内容(Sex)、冲突性言论(Conflictual)、垃圾信息(Spam)、亵渎性语言(Profanity)以及自残相关(Self-harm)。更多细节可参阅[参考论文](https://arxiv.org/abs/2411.19832)。
X-Sensitive的设计目标是作为开发网络内容审核工具的优质资源。基于该数据集已训练得到以下模型以实现此目标:
- [twitter-roberta-large-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-multilabel)
- [twitter-roberta-base-sensitive-multilabel](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-multilabel)
我们还提供了该系列模型的二分类版本,可将每条帖子划分为“敏感”或“非敏感”两类:
- [twitter-roberta-large-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-large-sensitive-binary)
- [twitter-roberta-base-sensitive-binary](https://huggingface.co/cardiffnlp/twitter-roberta-base-sensitive-binary)
## 数据集结构
### 数据划分
| 划分集名称 | 样本数量 |
|--------------|---------------|
| 训练集(train) | 5,000 |
| 测试集(test) | 2,000 |
| 验证集(validation) | 1,000 |
### 数据样例
训练集的一条样例如以下Python代码所示:
python
{'#labels': 1,
'conflictual': 0,
'conflictual_highlight': array([], dtype=object),
'drugs': 0,
'drugs_highlight': array([], dtype=object),
'keyword': 'fuckin',
'labels': array(['profanity'], dtype=object),
'profanity': 1,
'profanity_highlight': array([array(['fucking'], dtype=object), array(['fucking'], dtype=object),
array(['fucking'], dtype=object)], dtype=object),
'selfharm': 0,
'selfharm_highlight': array([], dtype=object),
'sex': 0,
'sex_highlight': array([], dtype=object),
'spam': 0,
'spam_highlight': array([], dtype=object),
'text': 'i think the idea of aliens is so fucking cool'}
### 标签说明
| 标签编号 | 标签名称 | 标签描述
|--------------|---------------|---------------|
| 0 | 冲突性言论(conflictual) | 基于受保护群体(种族、肤色、种姓、性别等)或其他类别发起的攻击性言论。|
| 1 | 亵渎性语言(profanity) | 含辱骂性词汇与亵渎性语言,即便其未针对特定对象。|
| 2 | 性内容(sex) | 露骨性内容,包括色情或其他类型的性相关内容。
| 4 | 自残相关(selfharm) | 自残相关内容,即描述、宣扬或美化针对自身的暴力或伤害行为(如进食障碍或自杀)的帖子。
| 5 | 垃圾信息(spam) | 未经请求的无关内容。
## 引用信息
@article{antypas2024sensitive,
title={Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation},
author={Antypas, Dimosthenis and Sen, Indira and Perez-Almendros, Carla and Camacho-Collados, Jose and Barbieri, Francesco},
journal={arXiv preprint arXiv:2411.19832},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-12



