ruanchaves/hatebr_por_Latn_to_eng_Latn

Name: ruanchaves/hatebr_por_Latn_to_eng_Latn
Creator: ruanchaves
Published: 2023-04-22 19:12:04
License: 暂无描述

Hugging Face2023-04-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ruanchaves/hatebr_por_Latn_to_eng_Latn

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instagram_comments dtype: string - name: offensive_language dtype: bool - name: offensiveness_levels dtype: int32 - name: antisemitism dtype: bool - name: apology_for_the_dictatorship dtype: bool - name: fatphobia dtype: bool - name: homophobia dtype: bool - name: partyism dtype: bool - name: racism dtype: bool - name: religious_intolerance dtype: bool - name: sexism dtype: bool - name: xenophobia dtype: bool - name: offensive_&_non-hate_speech dtype: bool - name: non-offensive dtype: bool - name: specialist_1_hate_speech dtype: bool - name: specialist_2_hate_speech dtype: bool - name: specialist_3_hate_speech dtype: bool splits: - name: train num_bytes: 391589 num_examples: 4480 - name: validation num_bytes: 86759 num_examples: 1120 - name: test num_bytes: 111044 num_examples: 1400 download_size: 0 dataset_size: 589392 --- # Dataset Card for "hatebr_por_Latn_to_eng_Latn" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

dataset_info: features: - name: Instagram评论（instagram_comments） dtype: 字符串（string） - name: 冒犯性语言（offensive_language） dtype: 布尔值（bool） - name: 冒犯程度等级（offensiveness_levels） dtype: 32位整数（int32） - name: 反犹主义（antisemitism） dtype: 布尔值（bool） - name: 为独裁政权致歉（apology_for_the_dictatorship） dtype: 布尔值（bool） - name: 肥胖偏见（fatphobia） dtype: 布尔值（bool） - name: 恐同症（homophobia） dtype: 布尔值（bool） - name: 党派偏见（partyism） dtype: 布尔值（bool） - name: 种族主义（racism） dtype: 布尔值（bool） - name: 宗教不宽容（religious_intolerance） dtype: 布尔值（bool） - name: 性别歧视（sexism） dtype: 布尔值（bool） - name: 仇外心理（xenophobia） dtype: 布尔值（bool） - name: 冒犯性与非仇恨性言论（offensive_&_non-hate_speech） dtype: 布尔值（bool） - name: 非冒犯性言论（non-offensive） dtype: 布尔值（bool） - name: 专家1标注的仇恨言论（specialist_1_hate_speech） dtype: 布尔值（bool） - name: 专家2标注的仇恨言论（specialist_2_hate_speech） dtype: 布尔值（bool） - name: 专家3标注的仇恨言论（specialist_3_hate_speech） dtype: 布尔值（bool） splits: - name: 训练集（train） num_bytes: 391589 num_examples: 4480 - name: 验证集（validation） num_bytes: 86759 num_examples: 1120 - name: 测试集（test） num_bytes: 111044 num_examples: 1400 download_size: 0 dataset_size: 589392 --- # "hatebr_por_Latn_to_eng_Latn"数据集卡片 [需补充更多信息](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

ruanchaves

原始信息汇总

数据集概述

数据集特征

instagram_comments：字符串类型
offensive_language：布尔类型
offensiveness_levels：整数类型（int32）
antisemitism：布尔类型
apology_for_the_dictatorship：布尔类型
fatphobia：布尔类型
homophobia：布尔类型
partyism：布尔类型
racism：布尔类型
religious_intolerance：布尔类型
sexism：布尔类型
xenophobia：布尔类型
offensive_&_non-hate_speech：布尔类型
non-offensive：布尔类型
specialist_1_hate_speech：布尔类型
specialist_2_hate_speech：布尔类型
specialist_3_hate_speech：布尔类型

数据集分割

训练集：
- 数据量：391589字节
- 示例数：4480
验证集：
- 数据量：86759字节
- 示例数：1120
测试集：
- 数据量：111044字节
- 示例数：1400

数据集大小

下载大小：0字节
数据集总大小：589392字节

搜集汇总

数据集介绍

构建方式

在社交媒体内容分析领域，HateBR数据集通过系统化采集巴西葡萄牙语（por_Latn）的Instagram评论构建而成。数据收集过程聚焦于识别仇恨言论，每条评论由多位专家进行多维度标注，涵盖攻击性语言、仇恨言论类型及非攻击性内容等特征。标注体系采用布尔值与等级评分相结合，确保了数据的细致分类与可靠性，为后续的机器学习模型训练提供了结构化基础。

特点

该数据集以多标签分类为显著特点，每条评论不仅标注是否包含攻击性语言，还细分为反犹太主义、恐同症、种族歧视等十种具体仇恨言论类别，并辅以攻击性等级评分。这种精细的标注方式使得数据集能够支持复杂的仇恨言论检测任务，同时其规模适中，包含训练、验证和测试分割，便于模型开发与评估，为跨语言仇恨言论研究提供了宝贵资源。

使用方法

研究人员可利用该数据集进行自然语言处理任务，特别是仇恨言论检测与分类。通过加载数据集分割，用户可以直接访问评论文本及对应的多标签标注，用于训练监督学习模型。数据集支持从葡萄牙语到英语的转换，便于跨语言分析，建议在预处理中结合文本清洗与特征工程，以优化模型性能，并利用验证集进行超参数调优，最终在测试集上评估模型效果。

背景与挑战

背景概述

随着社交媒体平台在全球范围内的普及，网络仇恨言论的检测与分类已成为自然语言处理领域的关键研究议题。由研究人员或机构创建的hatebr_por_Latn_to_eng_Latn数据集，专注于巴西葡萄牙语中的仇恨言论识别，其核心研究问题在于准确区分多种细粒度仇恨类别，如种族主义、性别歧视和宗教不容忍等。该数据集的构建旨在为多语言仇恨言论分析提供高质量标注资源，对推动跨文化语境下的内容审核与伦理人工智能发展具有显著影响力。

当前挑战

该数据集致力于解决仇恨言论检测领域的核心挑战，即如何在复杂语言表达中精准识别多种重叠的仇恨类别，同时处理巴西葡萄牙语特有的文化语境与俚语变体。在构建过程中，挑战主要体现在标注一致性上，由于仇恨言论的主观性与文化敏感性，需要多名专家协同标注以确保标签可靠性，且数据来源于社交媒体，面临噪声过滤与隐私保护的平衡难题。

常用场景

经典使用场景

在社交媒体内容审核与自然语言处理领域，hatebr_por_Latn_to_eng_Latn数据集以其葡萄牙语仇恨言论标注的精细结构，成为多标签分类任务中的经典资源。该数据集通过标注Instagram评论在多个维度上的攻击性表现，如种族主义、性别歧视、仇外心理等，为研究者提供了探索仇恨言论检测模型性能的基准平台。其多专家标注机制进一步确保了标签的可靠性与一致性，使得模型训练能够基于高质量的人类判断，从而推动跨语言仇恨言论识别技术的发展。

实际应用

在实际应用层面，hatebr_por_Latn_to_eng_Latn数据集被广泛用于构建自动化内容审核系统，帮助社交媒体平台高效识别并管理葡萄牙语社区的仇恨言论。其多维度标注支持定制化过滤策略，可针对特定攻击类型如反犹太主义或恐同言论进行精准干预。此外，该数据集也为政策制定者与教育机构提供了分析在线仇恨传播模式的数据依据，助力开发数字素养工具与反歧视倡议，从而在维护网络环境健康与促进社会包容性方面发挥切实作用。

衍生相关工作

基于该数据集，学术界衍生了一系列经典研究工作，包括多语言仇恨言论检测模型的比较分析与迁移学习框架的优化。研究者利用其细粒度标注探索了集成学习与主动学习策略在提升标注效率上的潜力，并开发了针对葡萄牙语语境下的偏见缓解算法。这些工作不仅扩展了仇恨言论计算模型的跨语言适用性，还推动了数据增强与对抗样本生成技术在内容安全领域的创新应用，为后续低资源语言处理任务设立了重要参考基准。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集