surrey-nlp/Cyberbullying-Detection-CB2
收藏Hugging Face2026-03-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/surrey-nlp/Cyberbullying-Detection-CB2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: unknown
task_categories:
- text-classification
task_ids:
- multi-class-classification
tags:
- cyberbullying
- hate-speech
- social-network
- school
- relational
- conversation
pretty_name: Cyberbullying Detection CB2
size_categories:
- 1K<n<10K
---
# Cyberbullying Detection — CB2
## Dataset Description
**CB2** is a relational, conversation-level cyberbullying detection dataset. Unlike single-post datasets, each instance in CB2 represents a **pair of users** and their full message exchange. The cyberbullying label is determined at the **conversation level** (i.e., whether the interaction between two users constitutes cyberbullying), enriched with demographic information, social closeness (peerness), and message-level aggression statistics.
The dataset was constructed from a school-age online communication study involving students aged 8–17 across 15 anonymised schools.
This dataset is part of the **Cyberbullying-Detection** collection on Hugging Face.
---
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `user1_id` | `int` | Unique ID of User 1 (initiator) |
| `user2_id` | `int` | Unique ID of User 2 (recipient) |
| `user1_age` | `int` | Age of User 1 |
| `user1_gender` | `string` | Gender of User 1 (`Male`, `Female`, `Others`) |
| `user1_grade` | `int` | School grade of User 1 |
| `user2_age` | `int` | Age of User 2 |
| `user2_gender` | `string` | Gender of User 2 (`Male`, `Female`, `Others`) |
| `user2_grade` | `int` | School grade of User 2 |
| `total_messages` | `int` | Total number of messages exchanged between the pair |
| `aggressive_count` | `int` | Number of messages classified as aggressive |
| `intent_to_harm` | `float` | Computed intent-to-harm score (0.0–1.0) |
| `peerness` | `float` | Social closeness / similarity score between the two users (0.0–1.0) |
| `conversation` | `list[dict]` | Ordered list of messages: each entry is `{"message": str, "label": int}` where `label` is `1` (aggressive) or `0` (non-aggressive) |
| `label` | `int` | Binary cyberbullying label: `1` = cyberbullying, `0` = not cyberbullying |
### Label Classes
| Value | Meaning |
|-------|---------|
| `1` | The user-pair interaction constitutes cyberbullying |
| `0` | The user-pair interaction does not constitute cyberbullying |
---
## Source Files
CB2 was assembled from 6 source files:
| File | Role |
|------|------|
| `1. users_data.csv` | Demographic info per user (age, gender, school, grade) |
| `2. peerness_values.csv` | Pairwise social closeness scores |
| `3. Aggressive_All.csv` | Corpus of all aggressive messages (reference) |
| `4. Non_Aggressive_All.csv` | Corpus of all non-aggressive messages (reference) |
| `5. Communication_Data_Among_Users.csv` | Timestamped message log with per-message aggression labels |
| `6. CB_Labels.csv` | **Pivot file** — one row per user-pair with aggregated stats and final CB label |
---
## Dataset Splits
The dataset is split as follows:
| Split | Size | Description |
|-------|------|-------------|
| `train` | 75% of total | Training set |
| `validation` | 2,000 rows | Development / validation set (sampled from the 25% held-out portion) |
| `test` | Remaining ~25% minus 2,000 | Test set |
### Split Methodology
```python
from sklearn.model_selection import train_test_split
# Step 1: 75% train, 25% test+dev (stratified on label)
train_df, test_dev_df = train_test_split(df, test_size=0.25, random_state=42, stratify=df["label"])
# Step 2: 2000 rows for dev, rest for test
dev_df = test_dev_df.sample(n=2000, random_state=42)
test_df = test_dev_df.drop(dev_df.index)
```
---
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Washii/Cyberbullying-Detection-CB2")
# Access splits
train = dataset["train"]
validation = dataset["validation"]
test = dataset["test"]
# Example row
print(train[0])
# {
# 'user1_id': 1, 'user2_id': 2,
# 'user1_age': 11, 'user1_gender': 'Others', 'user1_grade': 5,
# 'user2_age': 15, 'user2_gender': 'Male', 'user2_grade': 9,
# 'total_messages': 36, 'aggressive_count': 23,
# 'intent_to_harm': 0.769, 'peerness': 0.5,
# 'conversation': [
# {'message': 'bye bye dear bajaj...', 'label': 1},
# {'message': 'Article updated', 'label': 0},
# ...
# ],
# 'label': 1
# }
```
---
## Construction Notes
- The `conversation` field is built by grouping all messages in `5. Communication_Data_Among_Users.csv` by `(User1 ID, User2 ID)`, sorted by `Date` and `Time`, and stored as a list of `{message, label}` dicts.
- The `peerness` field in the final table comes directly from `6. CB_Labels.csv` (which already incorporates values from `2. peerness_values.csv`).
- User demographics are joined twice (once for User 1, once for User 2) from `1. users_data.csv`.
- Files `3. Aggressive_All.csv` and `4. Non_Aggressive_All.csv` are reference corpora and are **not** directly joined into the final table (their content is already represented via file 5).
---
## Source Data
The original data is sourced from https://data.mendeley.com/datasets/wmx9jj2htd/2 (A Comprehensive Dataset for Automated Cyberbullying Detection)
## Citation
If you use this dataset, please cite the original source appropriately.
---
## Dataset Card Authors
Uploaded and curated by [Washii](https://huggingface.co/Washii).
提供机构:
surrey-nlp



