资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- pl
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- intent-classification
pretty_name: Poleval 2019 cyberbullying
dataset_info:
- config_name: task01
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '1'
splits:
- name: train
num_bytes: 1104322
num_examples: 10041
- name: test
num_bytes: 109681
num_examples: 1000
download_size: 410001
dataset_size: 1214003
- config_name: task02
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '1'
'2': '2'
splits:
- name: train
num_bytes: 1104322
num_examples: 10041
- name: test
num_bytes: 109681
num_examples: 1000
download_size: 410147
dataset_size: 1214003
---
# Dataset Card for Poleval 2019 cyberbullying
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** http://2019.poleval.pl/index.php/tasks/task6
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Task 6-1: Harmful vs non-harmful
In this task, the participants are to distinguish between normal/non-harmful tweets (class: 0) and tweets that contain any kind of harmful
information (class: 1). This includes cyberbullying, hate speech and related phenomena. The data for the task is available now and can be
downloaded from the link provided below.
Task 6-2: Type of harmfulness
In this task, the participants shall distinguish between three classes of tweets: 0 (non-harmful), 1 (cyberbullying), 2 (hate-speech). There
are various definitions of both cyberbullying and hate-speech, some of them even putting those two phenomena in the same group. The specific
conditions on which we based our annotations for both cyberbullying and hate-speech, which have been worked out during ten years of research
will be summarized in an introductory paper for the task, however, the main and definitive condition to distinguish the two is whether the
harmful action is addressed towards a private person(s) (cyberbullying), or a public person/entity/large group (hate-speech).
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
Polish
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
- text: the provided tweet
- label: for task 6-1 the label can be 0 (non-harmful) or 1 (harmful)
for task 6-2 the label can be 0 (non-harmful), 1 (cyberbullying) or 2 (hate-speech)
### Data Splits
Train and Test
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@proceedings{ogr:kob:19:poleval,
editor = {Maciej Ogrodniczuk and Łukasz Kobyliński},
title = {{Proceedings of the PolEval 2019 Workshop}},
year = {2019},
address = {Warsaw, Poland},
publisher = {Institute of Computer Science, Polish Academy of Sciences},
url = {http://2019.poleval.pl/files/poleval2019.pdf},
isbn = "978-83-63159-28-3"}
}
```
### Contributions
Thanks to [@czabo](https://github.com/czabo) for adding this dataset.
annotations_creators:
- 公开获取
language_creators:
- 公开获取
language:
- 波兰语
license:
- 未知
multilinguality:
- 单语言
size_categories:
- 10000 < 样本数 < 100000
source_datasets:
- 原生数据集
task_categories:
- 文本分类
task_ids:
- 意图分类
pretty_name: Poleval 2019 网络欺凌(cyberbullying)
dataset_info:
- config_name: task01
features:
- name: text
dtype: 字符串
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '1'
splits:
- name: train
num_bytes: 1104322
num_examples: 10041
- name: test
num_bytes: 109681
num_examples: 1000
download_size: 410001
dataset_size: 1214003
- config_name: task02
features:
- name: text
dtype: 字符串
- name: label
dtype:
class_label:
names:
'0': '0'
'1': '1'
'2': '2'
splits:
- name: train
num_bytes: 1104322
num_examples: 10041
- name: test
num_bytes: 109681
num_examples: 1000
download_size: 410147
dataset_size: 1214003
# Poleval 2019 网络欺凌数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集摘要](#数据集摘要)
- [支持任务与评测榜单](#支持任务与评测榜单)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注信息](#标注信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **主页**:http://2019.poleval.pl/index.php/tasks/task6
- **代码仓库**:
- **相关论文**:
- **评测榜单**:
- **联系人**:
### 数据集摘要
#### 任务6-1:有害内容与无害内容二分类
本任务要求参与者区分正常/无害推文(类别:0)与包含任意形式有害信息的推文(类别:1),其中有害信息涵盖网络欺凌(cyberbullying)、仇恨言论(hate speech)及相关不良现象。本任务数据集现已开放,可通过下方提供的链接下载。
#### 任务6-2:有害性类型分类
本任务要求参与者对推文进行三分类:0(无害)、1(网络欺凌(cyberbullying))、2(仇恨言论(hate-speech))。目前针对网络欺凌与仇恨言论存在多种定义,部分定义甚至将二者归为同一类别。本次标注所依据的具体准则源自十年研究成果,将在本任务的介绍性论文中详细说明;但区分二者的核心判定标准为:有害行为的针对对象为私人个体(网络欺凌),还是公众人物/公共实体/大型群体(仇恨言论)。
### 支持任务与评测榜单
[需补充更多信息]
### 语言
波兰语
## 数据集结构
### 数据实例
[需补充更多信息]
### 数据字段
- `text`:输入的推文文本
- `label`:
- 针对任务6-1,标签取值为0(无害)或1(有害)
- 针对任务6-2,标签取值为0(无害)、1(网络欺凌)或2(仇恨言论)
### 数据划分
训练集与测试集
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生产者身份
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注者身份
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
@proceedings{ogr:kob:19:poleval,
editor = {Maciej Ogrodniczuk and Łukasz Kobyliński},
title = {{PolEval 2019 Workshop论文集}},
year = {2019},
address = {波兰华沙},
publisher = {波兰科学院计算机科学研究所},
url = {http://2019.poleval.pl/files/poleval2019.pdf},
isbn = "978-83-63159-28-3"}
}
### 贡献致谢
感谢[@czabo](https://github.com/czabo) 为本数据集的收录提供支持。