somosnlp-hackathon-2023/informes_discriminacion_gitana

Name: somosnlp-hackathon-2023/informes_discriminacion_gitana
Creator: somosnlp-hackathon-2023
Published: 2023-04-11 09:29:14
License: 暂无描述

Hugging Face2023-04-11 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/somosnlp-hackathon-2023/informes_discriminacion_gitana

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个西班牙语数据集，来源于Fundación Secretariado Gitano的文档中心，记录了吉普赛人遭受的不同歧视情况。数据集的目的是创建一个生成干预措施的系统，以最小化歧视事件的影响。数据集通过网页抓取和PDF提取获得，包含了歧视案例的格式（事实、干预、结果）。数据集经过预处理和清洗，以确保格式统一。数据集支持文本分类和文本生成任务，主要用于生成干预措施和预测歧视类型。数据集包含1990个实例，分为训练集、验证集和测试集，数据不平衡，未来计划增加数据量以平衡数据集。

This is a Spanish-language dataset sourced from the document repository of Fundación Secretariado Gitano, which documents various instances of discrimination faced by the Romani people. The goal of this dataset is to develop a system for generating intervention measures to mitigate the impacts of discriminatory incidents. The dataset was collected via web scraping and PDF extraction, and contains structured discriminatory case records with three components: facts, interventions, and outcomes. The dataset has undergone preprocessing and cleaning to ensure uniform formatting. This dataset supports both text classification and text generation tasks, and is primarily used for generating intervention measures and predicting the type of discrimination. The dataset consists of 1990 instances, which are split into training, validation, and test sets. The data is imbalanced, and there are plans to expand the dataset size in the future to address the class imbalance.

提供机构：

somosnlp-hackathon-2023

原始信息汇总

数据集概述

数据集基本信息

名称: 未提供具体名称
语言: 西班牙语（es）
任务类别: 文本分类（text-classification）和文本生成（text2text-generation）
标签: 仇恨（hate）
大小类别: 小于1000（n<1K）
许可证: Apache-2.0

数据集结构

特征:
- sintetico: 指示数据是否为原始数据（值为0）或合成数据（值为1）
- text: 描述受影响者的事实
- intervencion: 描述基金会为防止事实重复所采取的措施
- tipo_discriminacion: 标识歧视类型，可能的值包括多种歧视类型
- resultado: 描述干预的影响，可能的值为正面、负面或中性

数据集分割

训练集: 1791个例子，1569183.3字节
测试集: 100个例子，87614.92字节
验证集: 99个例子，86738.78字节
总实例数: 1990

数据集创建和处理

数据来源: 基金会秘书处Gitano的文档中心
数据收集和处理: 从网站上抓取并提取包含歧视案例的PDF文件，使用预处理脚本统一数据格式
数据清理: 使用pysentimiento库对结果字段进行分类，并使用Few-Shot Learning和Bloom模型填充缺失的干预和结果字段
数据摘要: 使用预训练模型对过长的事实文本进行摘要处理

数据集用途

社会影响: 旨在作为工具，帮助实施措施以对抗对吉普赛人口的种族主义，并评估不同措施的影响

数据集平衡性

结果分布: 正面280个，负面939个，中性771个，表明数据集在结果方面不平衡

数据集未来更新

计划: 将努力增加数据集的大小，以实现平衡

数据集详细信息

数据集实例示例

json { "sintetico": "0", "text": "Una joven gitana comenzó a trabajar en una tienda de ropa... (省略详细描述)", "intervencion": "Se entrevistó a la joven... (省略详细描述)", "tipo_discriminacion": "Discriminación directa", "resultado": "Negativo." }

数据集分割详情

训练集: 90%的输入句子，平均句子长度94.71
验证集: 5%的输入句子，平均句子长度90.94
测试集: 5%的输入句子，平均句子长度98.07

数据集创建理由

目的: 客观了解基金会当前采取的措施是否有效，是否需要改进措施以更好地支持吉普赛人口

数据集收集和规范化

数据提取: 从基金会秘书处Gitano的网站上提取，仅关注事实、干预、结果和歧视类型字段
数据清理和规范化: 使用pysentimiento库对结果进行分类，并使用Few-Shot Learning和Bloom模型填充缺失的干预和结果字段

数据集注释

注释过程: 使用Argilla进行结果类别的标记，包括正面、负面和中性
注释细节: 验证每个实例的干预和结果是否正确，调整不一致的标签

数据集使用注意事项

社会影响: 数据集旨在帮助对抗种族主义，评估和改进对吉普赛人口的支持措施
数据敏感性: 数据不包含侵犯受影响者权利的信息，无需进行匿名化处理

5,000+

优质数据集

54 个

任务类型

进入经典数据集