SEACrowd/id_abusive

Name: SEACrowd/id_abusive
Creator: SEACrowd
Published: 2024-06-24 13:24:27
License: 暂无描述

Hugging Face2024-06-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/id_abusive

下载链接

链接失效反馈

官方服务：

资源简介：

ID_ABUSIVE数据集是一个包含2016条印尼语非正式辱骂性推文的集合，专为情感分析NLP任务设计。这些推文是从Twitter上爬取的，并由20名志愿者手动标注为三类：非辱骂性语言、辱骂性但不具攻击性、以及攻击性语言。

The ID_ABUSIVE dataset is a collection of 2,016 informal abusive tweets in the Indonesian language, designed for sentiment analysis NLP tasks. This dataset is crawled from Twitter and then filtered and manually labeled by 20 volunteer annotators. The dataset is labeled into three categories: not abusive language, abusive but not offensive, and offensive language.

提供机构：

SEACrowd

原始信息汇总

数据集概述

数据集名称

ID_ABUSIVE

语言

印度尼西亚语

任务类别

情感分析

数据集描述

ID_ABUSIVE数据集是一个包含2,016条非正式辱骂性推文的集合，使用印度尼西亚语，专为情感分析自然语言处理任务设计。该数据集从Twitter爬取，然后由20名志愿者注释者手动过滤和标记。数据集标记为三种标签：非辱骂性语言、辱骂性但非冒犯性、冒犯性语言。

支持任务

情感分析

数据集版本

源版本：1.0.0
SEACrowd版本：2024.06.20

数据集许可

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

引用

如果您在使用Id Abusive数据集，请引用以下内容：

@article{IBROHIM2018222, title = {A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media}, journal = {Procedia Computer Science}, volume = {135}, pages = {222-229}, year = {2018}, note = {The 3rd International Conference on Computer Science and Computational Intelligence (ICCSCI 2018) : Empowering Smart Technology in Digital Era for a Better Life}, issn = {1877-0509}, doi = {https://doi.org/10.1016/j.procs.2018.08.169}, url = {https://www.sciencedirect.com/science/article/pii/S1877050918314583}, author = {Muhammad Okky Ibrohim and Indra Budi}, keywords = {abusive language, twitter, machine learning}, abstract = {Abusive language is an expression (both oral or text) that contains abusive/dirty words or phrases both in the context of jokes, a vulgar sex conservation or to cursing someone. Nowadays many people on the internet (netizens) write and post an abusive language in the social media such as Facebook, Line, Twitter, etc. Detecting an abusive language in social media is a difficult problem to resolve because this problem can not be resolved just use word matching. This paper discusses a preliminaries study for abusive language detection in Indonesian social media and the challenge in developing a system for Indonesian abusive language detection, especially in social media. We also built reported an experiment for abusive language detection on Indonesian tweet using machine learning approach with a simple word n-gram and char n-gram features. We use Naive Bayes, Support Vector Machine, and Random Forest Decision Tree classifier to identify the tweet whether the tweet is a not abusive language, abusive but not offensive, or offensive language. The experiment results show that the Naive Bayes classifier with the combination of word unigram + bigrams features gives the best result i.e. 70.06% of F1 - Score. However, if we classifying the tweet into two labels only (not abusive language and abusive language), all classifier that we used gives a higher result (more than 83% of F1 - Score for every classifier). The dataset in this experiment is available for other researchers that interest to improved this study.} }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

搜集汇总

数据集介绍

背景与挑战

背景概述

ID_ABUSIVE数据集包含2,016条印尼语非正式辱骂推文，专为情感分析任务设计。数据来自Twitter，由20名志愿者手动标注为三类标签：非辱骂语言、辱骂但非攻击性语言、攻击性语言。该数据集支持印尼语情感分析研究，采用知识共享署名-非商业性使用-禁止演绎4.0国际许可协议。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集