five

SEACrowd/local_id_abusive

收藏
Hugging Face2024-06-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SEACrowd/local_id_abusive
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集用于检测包含爪哇语和巽他语词汇的推文中的辱骂和仇恨言论。数据收集使用了Twitter搜索API和Tweepy库,通过查询印尼推文中的辱骂词汇列表来收集推文,并将这些词汇翻译成爪哇语和巽他语。数据集包含超过5000条推文,经过过滤和标注,用于判断推文是否包含仇恨言论或辱骂语言。

This dataset is for abusive and hate speech detection, using Twitter text containing Javanese and Sundanese words. The Indonesian local language dataset collection was conducted using Twitter search API to collect the tweets and then implemented using Tweepy Library. The tweets were collected using queries from the list of abusive words in Indonesian tweets. The abusive words were translated into local Indonesian languages, which are Javanese and Sundanese. The translated words are then used as queries to collect tweets containing Indonesian and local languages. The crawling process has collected a total of more than 5000 tweets. Then, the crawled data were filtered to get tweets that contain local’s vocabulary and/or sentences in Javanese and Sundanese. Next, after the filtering process, the data will be labeled whether the tweets are labeled as hate speech and abusive language or not.
提供机构:
SEACrowd
原始信息汇总

数据集概述

语言

  • Javanese (jav)
  • Sundanese (sun)

任务类别

  • 基于方面的情感分析 (Aspect Based Sentiment Analysis)

数据集描述

该数据集用于检测辱骂和仇恨言论,使用包含爪哇语和巽他语单词的Twitter文本。数据集收集自Twitter搜索API,并通过Tweepy库实现。收集的推文使用印尼语中的辱骂词汇列表进行查询,这些词汇被翻译成爪哇语和巽他语,然后作为查询条件收集包含印尼语和当地语言的推文。翻译过程涉及每种当地语言的母语者。爬取过程共收集了超过5000条推文。然后,对爬取的数据进行过滤,以获取包含爪哇语和巽他语词汇和/或句子的推文。接下来,经过过滤过程后,数据将被标记为是否包含仇恨言论和辱骂语言。

数据集版本

  • 源版本: 1.0.0
  • SEACrowd版本: 2024.06.20

数据集许可证

  • 未知

引用

如果您在使用Local Id Abusive数据集,请引用以下内容:

@inproceedings{putri2021abusive, title={Abusive language and hate speech detection for Javanese and Sundanese languages in tweets: Dataset and preliminary study}, author={Putri, Shofianina Dwi Ananda and Ibrohim, Muhammad Okky and Budi, Indra}, booktitle={2021 11th International Workshop on Computer Science and Engineering, WCSE 2021}, pages={461--465}, year={2021}, organization={International Workshop on Computer Science and Engineering (WCSE)}, abstract={Indonesia’s demography as an archipelago with lots of tribes and local languages added variances in their communication style. Every region in Indonesia has its own distinct culture, accents, and languages. The demographical condition can influence the characteristic of the language used in social media, such as Twitter. It can be found that Indonesian uses their own local language for communicating and expressing their mind in tweets. Nowadays, research about identifying hate speech and abusive language has become an attractive and developing topic. Moreover, the research related to Indonesian local languages still rarely encountered. This paper analyzes the use of machine learning approaches such as Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest Decision Tree (RFDT) in detecting hate speech and abusive language in Sundanese and Javanese as Indonesian local languages. The classifiers were used with the several term weightings features, such as word n-grams and char n-grams. The experiments are evaluated using the F-measure. It achieves over 60 % for both local languages.} }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作