strombergnlp/offenseval_2020

Name: strombergnlp/offenseval_2020
Creator: strombergnlp
Published: 2022-05-12 10:04:57
License: 暂无描述

Hugging Face2022-05-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/strombergnlp/offenseval_2020

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found languages: - ar - da - en - gr - tr licenses: - cc-by-4.0 multilinguality: - multilingual pretty_name: OffensEval 2020 size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - hate-speech-detection - text-classification-other-hate-speech-detection extra_gated_prompt: "Warning: this repository contains harmful content (abusive language, hate speech)." paperswithcode_id: - dkhate - ogtd --- # Dataset Card for "offenseval_2020" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://sites.google.com/site/offensevalsharedtask/results-and-paper-submission](https://sites.google.com/site/offensevalsharedtask/results-and-paper-submission) - **Repository:** - **Paper:** [https://aclanthology.org/2020.semeval-1.188/](https://aclanthology.org/2020.semeval-1.188/), [https://arxiv.org/abs/2006.07235](https://arxiv.org/abs/2006.07235) - **Point of Contact:** [Leon Derczynski](https://github.com/leondz) ### Dataset Summary OffensEval 2020 features a multilingual dataset with five languages. The languages included in OffensEval 2020 are: * Arabic * Danish * English * Greek * Turkish The annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID) and used in OffensEval 2019. In this taxonomy we break down offensive content into the following three sub-tasks taking the type and target of offensive content into account. The following sub-tasks were organized: * Sub-task A - Offensive language identification; * Sub-task B - Automatic categorization of offense types; * Sub-task C - Offense target identification. English training data is omitted so needs to be collected otherwise (see [https://zenodo.org/record/3950379#.XxZ-aFVKipp](https://zenodo.org/record/3950379#.XxZ-aFVKipp)) The source datasets come from: * Arabic [https://arxiv.org/pdf/2004.02192.pdf](https://arxiv.org/pdf/2004.02192.pdf), [https://aclanthology.org/2021.wanlp-1.13/](https://aclanthology.org/2021.wanlp-1.13/) * Danish [https://arxiv.org/pdf/1908.04531.pdf](https://arxiv.org/pdf/1908.04531.pdf), [https://aclanthology.org/2020.lrec-1.430/?ref=https://githubhelp.com](https://aclanthology.org/2020.lrec-1.430/) * English [https://arxiv.org/pdf/2004.14454.pdf](https://arxiv.org/pdf/2004.14454.pdf), [https://aclanthology.org/2021.findings-acl.80.pdf](https://aclanthology.org/2021.findings-acl.80.pdf) * Greek [https://arxiv.org/pdf/2003.07459.pdf](https://arxiv.org/pdf/2003.07459.pdf), [https://aclanthology.org/2020.lrec-1.629/](https://aclanthology.org/2020.lrec-1.629/) * Turkish [https://aclanthology.org/2020.lrec-1.758/](https://aclanthology.org/2020.lrec-1.758/) ### Supported Tasks and Leaderboards * [OffensEval 2020](https://sites.google.com/site/offensevalsharedtask/results-and-paper-submission) ### Languages Five are covered: bcp47 `ar;da;en;gr;tr` ## Dataset Structure There are five named configs, one per language: * `ar` Arabic * `da` Danish * `en` English * `gr` Greek * `tr` Turkish The training data for English is absent - this is 9M tweets that need to be rehydrated on their own. See [https://zenodo.org/record/3950379#.XxZ-aFVKipp](https://zenodo.org/record/3950379#.XxZ-aFVKipp) ### Data Instances An example of 'train' looks as follows. ``` { 'id': '0', 'text': 'PLACEHOLDER TEXT', 'subtask_a': 1, } ``` ### Data Fields - `id`: a `string` feature. - `text`: a `string`. - `subtask_a`: whether or not the instance is offensive; `0: NOT, 1: OFF` ### Data Splits | name |train|test| |---------|----:|---:| |ar|7839|1827| |da|2961|329| |en|0|3887| |gr|8743|1544| |tr|31277|3515| ## Dataset Creation ### Curation Rationale Collecting data for abusive language classification. Different rational for each dataset. ### Source Data #### Initial Data Collection and Normalization Varies per language dataset #### Who are the source language producers? Social media users ### Annotations #### Annotation process Varies per language dataset #### Who are the annotators? Varies per language dataset; native speakers ### Personal and Sensitive Information The data was public at the time of collection. No PII removal has been performed. ## Considerations for Using the Data ### Social Impact of Dataset The data definitely contains abusive language. The data could be used to develop and propagate offensive language against every target group involved, i.e. ableism, racism, sexism, ageism, and so on. ### Discussion of Biases ### Other Known Limitations ## Additional Information ### Dataset Curators The datasets is curated by each sub-part's paper authors. ### Licensing Information This data is available and distributed under Creative Commons attribution license, CC-BY 4.0. ### Citation Information ``` @inproceedings{zampieri-etal-2020-semeval, title = "{S}em{E}val-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({O}ffens{E}val 2020)", author = {Zampieri, Marcos and Nakov, Preslav and Rosenthal, Sara and Atanasova, Pepa and Karadzhov, Georgi and Mubarak, Hamdy and Derczynski, Leon and Pitenis, Zeses and {\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}}, booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation", month = dec, year = "2020", address = "Barcelona (online)", publisher = "International Committee for Computational Linguistics", url = "https://aclanthology.org/2020.semeval-1.188", doi = "10.18653/v1/2020.semeval-1.188", pages = "1425--1447", abstract = "We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.", } ``` ### Contributions Author-added dataset [@leondz](https://github.com/leondz)

提供机构：

strombergnlp

原始信息汇总

数据集概述

数据集名称

Pretty Name: OffensEval 2020

数据集语言

Languages: Arabic, Danish, English, Greek, Turkish

数据集许可证

Licenses: cc-by-4.0

数据集大小

Size Categories: 10K<n<100K

数据集多语言性

Multilinguality: multilingual

数据集任务

Task Categories: text-classification
Task IDs: hate-speech-detection, text-classification-other-hate-speech-detection

数据集结构

Data Instances:
- id: string
- text: string
- subtask_a: 0: NOT, 1: OFF

数据集分割

Data Splits:
- ar: train=7839, test=1827
- da: train=2961, test=329
- en: train=0, test=3887
- gr: train=8743, test=1544
- tr: train=31277, test=3515

数据集创建

Source Data:
- Arabic: https://arxiv.org/pdf/2004.02192.pdf, https://aclanthology.org/2021.wanlp-1.13/
- Danish: https://arxiv.org/pdf/1908.04531.pdf, https://aclanthology.org/2020.lrec-1.430/
- English: https://arxiv.org/pdf/2004.14454.pdf, https://aclanthology.org/2021.findings-acl.80.pdf
- Greek: https://arxiv.org/pdf/2003.07459.pdf, https://aclanthology.org/2020.lrec-1.629/
- Turkish: https://aclanthology.org/2020.lrec-1.758/

数据集注意事项

Warning: Contains harmful content (abusive language, hate speech).

搜集汇总

数据集介绍

构建方式

在社交媒体内容分析领域，OffensEval 2020数据集的构建体现了多语言协同研究的深度整合。该数据集汇集了阿拉伯语、丹麦语、英语、希腊语和土耳其语五种语言的文本，其源数据均来自公开的社交媒体平台，由各语言领域的专家团队分别进行收集与整理。构建过程中，研究者们采用了统一的层次化标注体系，该体系继承自OffensEval 2019任务中使用的OLID分类框架，将攻击性语言识别任务系统地分解为攻击性内容识别、攻击类型分类以及攻击目标识别三个子任务，确保了跨语言数据在任务定义上的一致性。

特点

该数据集的核心特点在于其多语言覆盖与精细的层次化任务设计。它囊括了来自不同文化背景和语言体系的社交媒体文本，为研究攻击性语言的跨语言共性与差异提供了宝贵资源。数据集遵循一个严谨的三层分类体系，不仅要求判断文本是否具有攻击性，还需进一步区分攻击的具体类型（如辱骂、挑衅）以及攻击所指向的目标（如个人、群体）。这种结构化的标注方式，使得该数据集能够支持从粗粒度到细粒度的多层次自然语言理解研究。

使用方法

在自然语言处理的应用实践中，该数据集主要用于训练和评估多语言攻击性内容检测模型。使用者可通过Hugging Face平台加载指定语言配置（如‘ar’、‘en’）来获取相应数据。数据以标准化的字段呈现，包含文本内容及其在子任务A上的标注。需要注意的是，英语部分的训练数据需研究者根据提供的指引自行从原始推文ID中复水获取。该数据集适用于监督学习，可助力开发能够识别并缓解网络有害信息的算法系统，但在使用时应充分意识到数据本身包含的攻击性内容可能带来的伦理风险。

背景与挑战

背景概述

在社交媒体内容审核与自然语言处理领域，识别与分类攻击性语言已成为一项紧迫的研究课题。OffensEval 2020数据集作为SemEval-2020竞赛的核心任务，由Marcos Zampieri、Preslav Nakov、Sara Rosenthal等多位学者于2020年联合构建，旨在推动多语言环境下攻击性语言的自动化检测技术。该数据集涵盖阿拉伯语、丹麦语、英语、希腊语和土耳其语五种语言，采用层次化标注体系，将攻击性内容识别细分为攻击性语言检测、攻击类型分类及攻击目标识别三个子任务，为跨语言仇恨言论分析提供了重要的基准资源，显著促进了计算语言学在社交媒体安全治理中的应用。

当前挑战

该数据集致力于解决多语言攻击性语言识别的复杂挑战，其核心问题在于不同语言文化背景下攻击性表达的多样性与模糊性，例如隐喻、讽刺等间接表达增加了分类难度。在构建过程中，面临多重挑战：一是数据收集需平衡语言覆盖与标注一致性，部分语言如英语的训练数据需从外部平台重新获取，增加了数据整合的复杂性；二是标注过程依赖母语者的主观判断，易引入文化偏见与标注歧义；三是数据本身包含有害内容，在确保研究效用的同时需严格防范其潜在的社会负面影响。

常用场景

经典使用场景

在社交媒体内容审核领域，OffensEval 2020数据集常被用于构建和评估多语言冒犯性语言识别模型。该数据集通过层次化标注体系，将冒犯性内容细分为识别、分类与目标判定三个子任务，为研究者提供了系统性的实验框架。其涵盖阿拉伯语、丹麦语、英语、希腊语和土耳其语五种语言，使得跨语言对比与迁移学习成为可能，推动了自然语言处理技术在敏感内容检测中的精细化发展。

衍生相关工作

围绕该数据集，学术界衍生出一系列经典研究工作，包括基于多任务学习的冒犯性语言检测框架、结合预训练语言模型的跨语言迁移方法，以及针对低资源语言的少样本学习策略。这些工作不仅推动了SemEval 2020竞赛的技术创新，还催生了如OLID标注体系的扩展应用、多语言仇恨言论语料库的构建范式，并为后续的细粒度情感分析与社会计算研究提供了重要参考。

数据集最近研究