five

hate-speech-filipino/hate_speech_filipino

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/hate-speech-filipino/hate_speech_filipino
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - crowdsourced language: - tl license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|other-twitter-data-philippine-election task_categories: - text-classification task_ids: - sentiment-analysis pretty_name: Hate Speech in Filipino dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 995919 num_examples: 10000 - name: test num_bytes: 995919 num_examples: 10000 - name: validation num_bytes: 424365 num_examples: 4232 download_size: 822927 dataset_size: 2416203 --- # Dataset Card for Hate Speech in Filipino ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Hate Speech Dataset in Filipino homepage](https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks) - **Repository:** [Hate Speech Dataset in Filipino homepage](https://github.com/jcblaisecruz02/Filipino-Text-Benchmarks) - **Paper:** [PCJ paper](https://pcj.csp.org.ph/index.php/pcj/issue/download/29/PCJ%20V14%20N1%20pp1-14%202019) - **Leaderboard:** - **Point of Contact:** [Jan Christian Cruz](mailto:jan_christian_cruz@dlsu.edu.ph) ### Dataset Summary Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset is primarily in Filipino, with the addition of some English words commonly used in Filipino vernacular ## Dataset Structure ### Data Instances Sample data: ``` { "text": "Taas ni Mar Roxas ah. KULTONG DILAW NGA NAMAN", "label": 1 } ``` ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale This study seeks to contribute to the filling of this gap through the development of a model that can automate hate speech detection and classification in Philippine election-related tweets. The role of the microblogging site Twitter as a platform for the expression of support and hate during the 2016 Philippine presidential election has been supported in news reports and systematic studies. Thus, the particular question addressed in this paper is: Can existing techniques in language processing and machine learning be applied to detect hate speech in the Philippine election context? ### Source Data #### Initial Data Collection and Normalization The dataset used in this study was a subset of the corpus 1,696,613 tweets crawled by Andrade et al. and posted from November 2015 to May 2016 during the campaign period for the Philippine presidential election. They were culled based on the presence of candidate names (e.g., Binay, Duterte, Poe, Roxas, and Santiago) and election-related hashtags (e.g., #Halalan2016, #Eleksyon2016, and #PiliPinas2016). Data preprocessing was performed to prepare the tweets for feature extraction and classification. It consisted of the following steps: data de-identification, uniform resource locator (URL) removal, special character processing, normalization, hashtag processing, and tokenization. [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [Jan Christian Cruz](mailto:jan_christian_cruz@dlsu.edu.ph) ### Licensing Information [More Information Needed] ### Citation Information @article{Cabasag-2019-hate-speech, title={Hate speech in Philippine election-related tweets: Automatic detection and classification using natural language processing.}, author={Neil Vicente Cabasag, Vicente Raphael Chan, Sean Christian Lim, Mark Edward Gonzales, and Charibeth Cheng}, journal={Philippine Computing Journal}, volume={XIV}, number={1}, month={August}, year={2019} } ### Contributions Thanks to [@anaerobeth](https://github.com/anaerobeth) for adding this dataset.
提供机构:
hate-speech-filipino
原始信息汇总

数据集概述

数据集描述

数据集摘要

包含10,000条推文(训练集),标记为仇恨言论或非仇恨言论。此外,还有4,232条验证样本和4,232条测试样本。这些数据是在2016年菲律宾总统选举期间收集的。

支持的任务和排行榜

  • 任务类别: 文本分类
  • 任务ID: 情感分析

语言

数据集主要使用菲律宾语,包含一些常用的英语词汇。

数据集结构

数据实例

示例数据: json { "text": "Taas ni Mar Roxas ah. KULTONG DILAW NGA NAMAN", "label": 1 }

数据字段

  • text: 字符串类型,推文内容
  • label: 类别标签,包含两个类别:0 和 1

数据分割

  • 训练集: 10,000条样本,995,919字节
  • 测试集: 10,000条样本,995,919字节
  • 验证集: 4,232条样本,424,365字节

数据集创建

数据集来源

数据集是从Andrade等人爬取的1,696,613条推文中筛选出的子集,这些推文发布于2015年11月至2016年5月菲律宾总统选举期间,筛选依据是包含候选人姓名(如Binay, Duterte, Poe, Roxas, Santiago)和选举相关标签(如#Halalan2016, #Eleksyon2016, #PiliPinas2016)。

数据预处理

数据预处理包括以下步骤:数据去标识化、统一资源定位符(URL)移除、特殊字符处理、规范化、标签处理和分词。

其他信息

数据集许可

  • 许可证: 未知

引用信息

plaintext @article{Cabasag-2019-hate-speech, title={Hate speech in Philippine election-related tweets: Automatic detection and classification using natural language processing.}, author={Neil Vicente Cabasag, Vicente Raphael Chan, Sean Christian Lim, Mark Edward Gonzales, and Charibeth Cheng}, journal={Philippine Computing Journal}, volume={XIV}, number={1}, month={August}, year={2019} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作