SEACrowd/id_abusive
收藏数据集概述
数据集名称
ID_ABUSIVE
语言
印度尼西亚语
任务类别
情感分析
标签
情感分析
数据集描述
ID_ABUSIVE数据集是一个包含2,016条非正式辱骂性推文的集合,使用印度尼西亚语,专为情感分析自然语言处理任务设计。该数据集从Twitter爬取,然后由20名志愿者注释者手动过滤和标记。数据集标记为三种标签:非辱骂性语言、辱骂性但非冒犯性、冒犯性语言。
支持任务
情感分析
数据集版本
源版本:1.0.0
SEACrowd版本:2024.06.20
数据集许可
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
引用
如果您在使用Id Abusive数据集,请引用以下内容:
@article{IBROHIM2018222, title = {A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media}, journal = {Procedia Computer Science}, volume = {135}, pages = {222-229}, year = {2018}, note = {The 3rd International Conference on Computer Science and Computational Intelligence (ICCSCI 2018) : Empowering Smart Technology in Digital Era for a Better Life}, issn = {1877-0509}, doi = {https://doi.org/10.1016/j.procs.2018.08.169}, url = {https://www.sciencedirect.com/science/article/pii/S1877050918314583}, author = {Muhammad Okky Ibrohim and Indra Budi}, keywords = {abusive language, twitter, machine learning}, abstract = {Abusive language is an expression (both oral or text) that contains abusive/dirty words or phrases both in the context of jokes, a vulgar sex conservation or to cursing someone. Nowadays many people on the internet (netizens) write and post an abusive language in the social media such as Facebook, Line, Twitter, etc. Detecting an abusive language in social media is a difficult problem to resolve because this problem can not be resolved just use word matching. This paper discusses a preliminaries study for abusive language detection in Indonesian social media and the challenge in developing a system for Indonesian abusive language detection, especially in social media. We also built reported an experiment for abusive language detection on Indonesian tweet using machine learning approach with a simple word n-gram and char n-gram features. We use Naive Bayes, Support Vector Machine, and Random Forest Decision Tree classifier to identify the tweet whether the tweet is a not abusive language, abusive but not offensive, or offensive language. The experiment results show that the Naive Bayes classifier with the combination of word unigram + bigrams features gives the best result i.e. 70.06% of F1 - Score. However, if we classifying the tweet into two labels only (not abusive language and abusive language), all classifier that we used gives a higher result (more than 83% of F1 - Score for every classifier). The dataset in this experiment is available for other researchers that interest to improved this study.} }
@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }




