Amharic Social Media Dataset for Hate Speech Detection and Classification in Amharic Text with Deep Learning
收藏DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/p74pfhz3yx
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is prepared for hate speech detection and classification into four categories of speech. Namely, Normal speech, Racial Hate speech, Religious Hate speech, Gender Hate speech and Disability Hate speech. This dataset is collected from three social media sites: Facebook, Twitter, and YouTube. The collection is done automatically and the data is annotated by human annotators. The dataset is collected only for Amharic Language.
To make a clear annotation process we have developed and prepared an annotation guideline. We have made the annotation process a twofold round. The first round annotation is done by 100 annotators who have different demographic and sociocultural backgrounds. Before the annotation process is started, besides giving the developed guideline, a brief introduction is given to the annotators which includes:
● What hate speech is
● Social media and hate speech
● Impact of hate speech
● Types of hate speech
● How to control hate speech and also
● How to use the annotation website system to annotate the hate speech dataset
To start the annotation, process the annotators have to sign up and login to our custom built annotators tool (https://annotate.shegerapps.com) called “Amharic Hate Speech Annotation Tool”. As the schema shows in Figure 5.2 the annotation tool which is the website has a database with ten tables in it, eight of the tables hold an annotated or labeled dataset. The rest two tables are to hold users (annotators, curators, and admin) for authentication purposes, and finally, the tenth table holds the raw data. Raw data table is a container where the to be annotated dataset is dumped then the annotators fetch the data from this raw data table, when data is annotated it is inserted into the respective eight tables. On this annotation part, we have annotated texts in eight categories but for this research, we need only the four categories. We included the other four hate speech categories for future studies so any interested researcher or ourselves can continue researching without the need for a new annotation. This annotation tool database is MySQL, the backend is developed using PHP and the frontend is done using HTML, JavaScript, and jQuery.
Some of the advantages of this annotation tool are to create an efficient team-based annotation experience, it maintains control for data preparation, it is used to manage annotators’ tasks and their progress, and also it makes exporting the annotated dataset easier.
After finalizing the annotation, the dataset is given to the respective model as input in CSV format. During training time this data is split into three with an 80:10:10 ratio for training, validation, and testing purposes
提供机构:
Mendeley
创建时间:
2022-08-12



