ComMA Project

Name: ComMA Project
Creator: 潘利语义处理有限公司
Published: 2020-03-17 04:19:21
License: 暂无描述

arXiv2020-03-17 更新2024-06-21 收录

下载链接：

https://sites.google.com/view/trac2/shared-task

下载链接

链接失效反馈

官方服务：

资源简介：

ComMA项目是一个多语言注释语料库，专注于印度英语、印地语和印度孟加拉语中的性别歧视和攻击性内容。数据集包含超过25,000条来自YouTube视频评论的数据，这些评论被标注为攻击性（公开攻击性、隐蔽攻击性和非攻击性）和性别歧视（性别化和非性别化）。数据集的创建过程涉及精心选择的数据收集源、用于标注的标签集，以及在标注过程中遇到的问题和挑战。该数据集旨在解决社交媒体上性别歧视和社区主义内容的自动识别问题，通过开发分类器来识别三种语言中的性别歧视。

The ComMA project is a multilingual annotated corpus focusing on sexist and offensive content in Indian English, Hindi, and Indian Bengali. The corpus contains over 25,000 YouTube video comments, which are annotated with two categories: offensiveness (openly offensive, covertly offensive, and non-offensive) and sexism (gendered and non-gendered). The construction of this corpus involves carefully curated data collection sources, a standardized annotation label set, as well as the issues and challenges encountered during the annotation process. This corpus aims to address the automatic detection of sexist and communalist content on social media, by developing classifiers to identify sexist content across the three target languages.

提供机构：

潘利语义处理有限公司

创建时间：

2020-03-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集