SEACrowd/local_id_abusive
收藏数据集概述
语言
- Javanese (jav)
- Sundanese (sun)
任务类别
- 基于方面的情感分析 (Aspect Based Sentiment Analysis)
数据集描述
该数据集用于检测辱骂和仇恨言论,使用包含爪哇语和巽他语单词的Twitter文本。数据集收集自Twitter搜索API,并通过Tweepy库实现。收集的推文使用印尼语中的辱骂词汇列表进行查询,这些词汇被翻译成爪哇语和巽他语,然后作为查询条件收集包含印尼语和当地语言的推文。翻译过程涉及每种当地语言的母语者。爬取过程共收集了超过5000条推文。然后,对爬取的数据进行过滤,以获取包含爪哇语和巽他语词汇和/或句子的推文。接下来,经过过滤过程后,数据将被标记为是否包含仇恨言论和辱骂语言。
数据集版本
- 源版本: 1.0.0
- SEACrowd版本: 2024.06.20
数据集许可证
- 未知
引用
如果您在使用Local Id Abusive数据集,请引用以下内容:
@inproceedings{putri2021abusive, title={Abusive language and hate speech detection for Javanese and Sundanese languages in tweets: Dataset and preliminary study}, author={Putri, Shofianina Dwi Ananda and Ibrohim, Muhammad Okky and Budi, Indra}, booktitle={2021 11th International Workshop on Computer Science and Engineering, WCSE 2021}, pages={461--465}, year={2021}, organization={International Workshop on Computer Science and Engineering (WCSE)}, abstract={Indonesia’s demography as an archipelago with lots of tribes and local languages added variances in their communication style. Every region in Indonesia has its own distinct culture, accents, and languages. The demographical condition can influence the characteristic of the language used in social media, such as Twitter. It can be found that Indonesian uses their own local language for communicating and expressing their mind in tweets. Nowadays, research about identifying hate speech and abusive language has become an attractive and developing topic. Moreover, the research related to Indonesian local languages still rarely encountered. This paper analyzes the use of machine learning approaches such as Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest Decision Tree (RFDT) in detecting hate speech and abusive language in Sundanese and Javanese as Indonesian local languages. The classifiers were used with the several term weightings features, such as word n-grams and char n-grams. The experiments are evaluated using the F-measure. It achieves over 60 % for both local languages.} }
@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }



