surrey-nlp/Cyberbullying-Detection-CB2

Name: surrey-nlp/Cyberbullying-Detection-CB2
Creator: surrey-nlp
Published: 2026-03-02 15:47:06
License: 暂无描述

Hugging Face2026-03-02 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/surrey-nlp/Cyberbullying-Detection-CB2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: unknown task_categories: - text-classification task_ids: - multi-class-classification tags: - cyberbullying - hate-speech - social-network - school - relational - conversation pretty_name: Cyberbullying Detection CB2 size_categories: - 1K<n<10K --- # Cyberbullying Detection — CB2 ## Dataset Description **CB2** is a relational, conversation-level cyberbullying detection dataset. Unlike single-post datasets, each instance in CB2 represents a **pair of users** and their full message exchange. The cyberbullying label is determined at the **conversation level** (i.e., whether the interaction between two users constitutes cyberbullying), enriched with demographic information, social closeness (peerness), and message-level aggression statistics. The dataset was constructed from a school-age online communication study involving students aged 8–17 across 15 anonymised schools. This dataset is part of the **Cyberbullying-Detection** collection on Hugging Face. --- ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `user1_id` | `int` | Unique ID of User 1 (initiator) | | `user2_id` | `int` | Unique ID of User 2 (recipient) | | `user1_age` | `int` | Age of User 1 | | `user1_gender` | `string` | Gender of User 1 (`Male`, `Female`, `Others`) | | `user1_grade` | `int` | School grade of User 1 | | `user2_age` | `int` | Age of User 2 | | `user2_gender` | `string` | Gender of User 2 (`Male`, `Female`, `Others`) | | `user2_grade` | `int` | School grade of User 2 | | `total_messages` | `int` | Total number of messages exchanged between the pair | | `aggressive_count` | `int` | Number of messages classified as aggressive | | `intent_to_harm` | `float` | Computed intent-to-harm score (0.0–1.0) | | `peerness` | `float` | Social closeness / similarity score between the two users (0.0–1.0) | | `conversation` | `list[dict]` | Ordered list of messages: each entry is `{"message": str, "label": int}` where `label` is `1` (aggressive) or `0` (non-aggressive) | | `label` | `int` | Binary cyberbullying label: `1` = cyberbullying, `0` = not cyberbullying | ### Label Classes | Value | Meaning | |-------|---------| | `1` | The user-pair interaction constitutes cyberbullying | | `0` | The user-pair interaction does not constitute cyberbullying | --- ## Source Files CB2 was assembled from 6 source files: | File | Role | |------|------| | `1. users_data.csv` | Demographic info per user (age, gender, school, grade) | | `2. peerness_values.csv` | Pairwise social closeness scores | | `3. Aggressive_All.csv` | Corpus of all aggressive messages (reference) | | `4. Non_Aggressive_All.csv` | Corpus of all non-aggressive messages (reference) | | `5. Communication_Data_Among_Users.csv` | Timestamped message log with per-message aggression labels | | `6. CB_Labels.csv` | **Pivot file** — one row per user-pair with aggregated stats and final CB label | --- ## Dataset Splits The dataset is split as follows: | Split | Size | Description | |-------|------|-------------| | `train` | 75% of total | Training set | | `validation` | 2,000 rows | Development / validation set (sampled from the 25% held-out portion) | | `test` | Remaining ~25% minus 2,000 | Test set | ### Split Methodology ```python from sklearn.model_selection import train_test_split # Step 1: 75% train, 25% test+dev (stratified on label) train_df, test_dev_df = train_test_split(df, test_size=0.25, random_state=42, stratify=df["label"]) # Step 2: 2000 rows for dev, rest for test dev_df = test_dev_df.sample(n=2000, random_state=42) test_df = test_dev_df.drop(dev_df.index) ``` --- ## Usage ```python from datasets import load_dataset dataset = load_dataset("Washii/Cyberbullying-Detection-CB2") # Access splits train = dataset["train"] validation = dataset["validation"] test = dataset["test"] # Example row print(train[0]) # { # 'user1_id': 1, 'user2_id': 2, # 'user1_age': 11, 'user1_gender': 'Others', 'user1_grade': 5, # 'user2_age': 15, 'user2_gender': 'Male', 'user2_grade': 9, # 'total_messages': 36, 'aggressive_count': 23, # 'intent_to_harm': 0.769, 'peerness': 0.5, # 'conversation': [ # {'message': 'bye bye dear bajaj...', 'label': 1}, # {'message': 'Article updated', 'label': 0}, # ... # ], # 'label': 1 # } ``` --- ## Construction Notes - The `conversation` field is built by grouping all messages in `5. Communication_Data_Among_Users.csv` by `(User1 ID, User2 ID)`, sorted by `Date` and `Time`, and stored as a list of `{message, label}` dicts. - The `peerness` field in the final table comes directly from `6. CB_Labels.csv` (which already incorporates values from `2. peerness_values.csv`). - User demographics are joined twice (once for User 1, once for User 2) from `1. users_data.csv`. - Files `3. Aggressive_All.csv` and `4. Non_Aggressive_All.csv` are reference corpora and are **not** directly joined into the final table (their content is already represented via file 5). --- ## Source Data The original data is sourced from https://data.mendeley.com/datasets/wmx9jj2htd/2 (A Comprehensive Dataset for Automated Cyberbullying Detection) ## Citation If you use this dataset, please cite the original source appropriately. --- ## Dataset Card Authors Uploaded and curated by [Washii](https://huggingface.co/Washii).

提供机构：

surrey-nlp

5,000+

优质数据集

54 个

任务类型

进入经典数据集