projecte-aina/InToxiCat

Name: projecte-aina/InToxiCat
Creator: projecte-aina
Published: 2024-10-11 15:36:34
License: 暂无描述

Hugging Face2024-10-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/projecte-aina/InToxiCat

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - ca license: - cc-by-nc-4.0 multilinguality: - monolingual task_categories: - text-classification - token-classification pretty_name: InToxiCat tags: - abusive-language-detection - abusive-language - toxic-language-detection - toxicity-detection dataset_info: features: - name: id dtype: string - name: context dtype: string - name: sentence dtype: string - name: topic dtype: string - name: keywords sequence: string - name: context_needed dtype: string - name: is_abusive dtype: int64 - name: abusiveness_agreement dtype: string - name: target_type sequence: int64 - name: abusive_spans struct: - name: text sequence: string - name: index sequence: string - name: target_spans struct: - name: text sequence: string - name: index sequence: string - name: is_implicit dtype: string splits: - name: train num_bytes: 18159422 num_examples: 23847 - name: test num_bytes: 2276428 num_examples: 2981 - name: validation num_bytes: 2285701 num_examples: 2981 download_size: 14619803 dataset_size: 22721551 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* --- # Dataset Card for InToxiCat ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Example](#example) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Website:** https://zenodo.org/records/10600606 - **Point of Contact:** langtech@bsc.es ### Dataset Summary InToxiCat is a dataset for the detection of abusive language (defined by the aim to harm someone, individual, group, etc.) in Catalan, produced by the BSC LangTech unit. The dataset consists of 29,809 sentences obtained from internet forums annotated as to whether or not they are abusive. The 6047 instances annotated as abusive are further annotated for the following features: abusive span, target span, target type and the implicit or explicit nature of the abusiveness in the message. The dataset is split, in a balanced abusive/non-abusive distribution, into 23,847 training samples, 2981 validation samples, and 2981 test samples. ### Supported Tasks and Leaderboards Abusive Language Detection ### Languages The dataset is in Catalan (`ca-ES`). ## Dataset Structure ### Data Instances Three JSON files, one for each split. ### Example: <pre> { "id": "9472844_16_0", "context": "Aquest tiu no té ni puta idea del que és una guerra ni del que s'espera d'un soldat.I què s'empatolla de despeses mèdiques. A veure si li passaré com al Hollande i sortiré la factura del seu perruquer (o taxidermista, no sé)", "sentence": "Aquest tiu no té ni puta idea del que és una guerra ni del que s'espera d'un soldat.I què s'empatolla de despeses mèdiques.", "topic": "Internacional", "key_words": [ "puta" ], "annotation": { "is_abusive": "abusive", "abusiveness_agreement": "full", "context_needed": "no", "abusive_spans": [ [ "no té ni puta idea", "11:29" ] ], "target_spans": [ [ "Aquest tiu", "0:10" ] ], "target_type": [ "INDIVIDUAL" ], "is_implicit": "yes" } } </pre> ### Data Fields - ``id`` (a string feature): unique identifier of the instance. - ``context`` (a string feature): complete text message from the user surrounding the sentence (it can coincide totally or only partially with the sentence). - ``sentence`` (a string feature): text message where the abusiveness is evaluated. - ``topic`` (a string feature): category from Racó Català forums where the sentence comes from. - ``keywords`` (a list of strings): keywords used to select the candidate messages to annotate. - ``context_needed`` (a string feature): "yes" / "no" if all the annotators consulted / did not consult the context to decide on the sentence's abusiveness, "maybe" if there was not agreement about it. - ``is_abusive`` (a bool feature): "abusive" or "not_abusive". - ``abusiveness_agreement`` (a string feature): "full" if the two annotators agreed on the abusiveness/not-abusiveness of the sentence, and "partial" if the abusiveness had to be decided by a third annotator. - ``abusive_spans`` (a dictionary with field 'text' (list of strings) and 'index' (list of strings)): the sequence of words that attribute to the text's abusiveness. - ``is_implicit`` (a string): whether the abusiveness is explicit (contains a profanity, slur or threat) or implicit (does not contain a profanity or slur, but is likely to contain irony, sarcasm or similar resources). - ``target_spans`` (a dictionary with field 'text' (list of strings) and 'index' (list of strings)): if found in the message, the sequence(s) of words that refer to the target of the text's abusiveness. - ``target_type`` (a dictionary with field 'text' (list of strings) and 'index' (list of strings)): three possible categories. The categories are non-exclusive, as some targets may have a dual identity and more than one target may be detected in a single message. - ``individual``: a famous person, a named person or an unnamed person interacting in the conversation. - ``group``: considered to be a unit based on the same ethnicity, gender or sexual orientation, political affiliation, religious belief or something else. - ``other``; e.g. an organization, a situation, an event, or an issue. ### Data Splits * train.json: 23847 examples * dev.json: 2981 examples * test.json: 2981 examples ## Dataset Creation ### Curation Rationale We created this dataset to contribute to the development of language models in Catalan, a low-resource language. ### Source Data #### Initial Data Collection and Normalization The sentences to be annotated were collected from [Racó Català](https://www.racocatala.cat/forums) forums using a list of keywords (provided in Zenodo). The messages belong to different categories of Racó Català, specified in the "topic" field of the dataset. The length of the messages varies from one sentence to several sentences. #### Who are the source language producers? Anonymized users from Racó Català forums. ### Annotations #### Annotation process The annotation process was divided into the following two tasks, carried out in sequential order: Task 1. The sentences (around 30.000) were annotated by two annotators as either abusive or not abusive. In case of ambiguity in the sentence, the annotators had the possibility to consult the context, i.e. the whole message of the user (if the sentence to be annotated was a segment contained in the message). In cases where annotators 1 and 2 disagreed about the abusiveness of a message, it was annotated by a third annotator. As a result, the sentences that are ultimately considered abusive are those that were initially annotated as abusive by both annotators or, in the case of an initial disagreement between them, those that were resolved as abusive by the third annotator. Task 2. The sentences annotated as abusive (6047) in Task 1 were further annotated by the two main annotators for the following features, explained in the Summary section: abusive spans, implicit/explicit abusiveness, target spans, and target type. The annotation guidelines are published and available on Zenodo. #### Who are the annotators? The annotators were qualified professionals with university education and a demonstrably excellent knowledge of Catalan (minimum level C1 or equivalent). ### Personal and Sensitive Information No personal or sensitive information included. ## Considerations for Using the Data ### Social Impact of Dataset We hope this dataset contributes to the development of language models in Catalan, a low-resource language. ### Discussion of Biases [N/A] ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es) This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). ### Licensing Information This work is licensed under a [Creative Commons Attribution Non-commercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/). ### Citation Information ``` @inproceedings{gonzalez-agirre-etal-2024-building-data, title = "Building a Data Infrastructure for a Mid-Resource Language: The Case of {C}atalan", author = "Gonzalez-Agirre, Aitor and Marimon, Montserrat and Rodriguez-Penagos, Carlos and Aula-Blasco, Javier and Baucells, Irene and Armentano-Oller, Carme and Palomar-Giner, Jorge and Kulebi, Baybars and Villegas, Marta", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.231", pages = "2556--2566", } ``` [![DOI](https://zenodo.org/badge/DOI/10.57967/hf/1719.svg)](https://doi.org/10.57967/hf/1719) ### Contributions [N/A]

提供机构：

projecte-aina

原始信息汇总

数据集卡片：InToxiCat

数据集描述

数据集概述

InToxiCat 是一个用于检测加泰罗尼亚语中滥用语言（旨在伤害个人、团体等）的数据集，由 BSC LangTech 单元制作。该数据集包含 29,809 个从互联网论坛获取的句子，标注了它们是否为滥用语言。其中 6047 个被标注为滥用的实例进一步标注了以下特征：滥用范围、目标范围、目标类型以及消息中滥用行为的隐含或显式性质。

数据集按照滥用/非滥用的平衡分布，分为 23,847 个训练样本、2981 个验证样本和 2981 个测试样本。

支持的任务和排行榜

滥用语言检测

语言

数据集使用加泰罗尼亚语（ca-ES）。

数据集结构

数据实例

三个 JSON 文件，分别对应每个拆分。

示例

json { "id": "9472844_16_0", "context": "Aquest tiu no té ni puta idea del que és una guerra ni del que sespera dun soldat.I què sempatolla de despeses mèdiques. A veure si li passaré com al Hollande i sortirà la factura del seu perruquer (o taxidermista, no sé)", "sentence": "Aquest tiu no té ni puta idea del que és una guerra ni del que sespera dun soldat.I què sempatolla de despeses mèdiques.", "topic": "Internacional", "key_words": ["puta"], "annotation": { "is_abusive": "abusive", "abusiveness_agreement": "full", "context_needed": "no", "abusive_spans": [["no té ni puta idea", "11:29"]], "target_spans": [["Aquest tiu", "0:10"]], "target_type": ["INDIVIDUAL"], "is_implicit": "yes" } }

数据字段

id (字符串特征): 实例的唯一标识符。
context (字符串特征): 用户完整文本消息，包含句子（可以完全或部分与句子重合）。
sentence (字符串特征): 评估滥用性的文本消息。
topic (字符串特征): 句子来源的 Racó Català 论坛类别。
keywords (字符串列表): 用于选择候选消息进行标注的关键词。
context_needed (字符串特征): 标注者是否需要参考上下文来决定句子的滥用性，"yes" / "no" / "maybe"。
is_abusive (布尔特征): "abusive" 或 "not_abusive"。
abusiveness_agreement (字符串特征): 两个标注者是否一致，"full" 或 "partial"。
abusive_spans (包含 text 和 index 的字典): 导致文本滥用的词序列。
is_implicit (字符串): 滥用行为是显式（包含亵渎、侮辱或威胁）还是隐式（不包含亵渎或侮辱，但可能包含讽刺、挖苦等）。
target_spans (包含 text 和 index 的字典): 如果消息中存在，指向文本滥用目标的词序列。
target_type (包含 text 和 index 的字典): 三种可能的类别，非互斥，一个目标可能具有双重身份，一个消息中可能检测到多个目标。
- individual: 名人、命名的人或未命名的人参与对话。
- group: 基于相同种族、性别或性取向、政治隶属关系、宗教信仰或其他共同特征的群体。
- other: 例如组织、情况、事件或问题。

数据拆分

train.json: 23847 个样本
dev.json: 2981 个样本
test.json: 2981 个样本

数据集创建

策划理由

我们创建这个数据集是为了促进加泰罗尼亚语这一低资源语言的语言模型的发展。

源数据

初始数据收集和规范化

待标注的句子从 Racó Català 论坛收集，使用关键词列表（在 Zenodo 上提供）。消息属于 Racó Català 的不同类别，在数据集的 "topic" 字段中指定。消息长度从一句话到几句话不等。

源语言生产者

Racó Català 论坛的匿名用户。

标注

标注过程

标注过程分为以下两个任务，按顺序进行：

任务 1. 约 30,000 个句子由两个标注者标注为滥用或非滥用。在句子模糊的情况下，标注者可以参考上下文，即用户的完整消息（如果待标注的句子是消息中的一个片段）。在标注者 1 和 2 对消息的滥用性有分歧的情况下，由第三个标注者进行标注。最终被认为是滥用的句子是那些最初被两个标注者标注为滥用，或在初始分歧中被第三个标注者解决为滥用的句子。

任务 2. 在任务 1 中被标注为滥用的 6047 个句子由两个主要标注者进一步标注以下特征，详见概述部分：滥用范围、隐式/显式滥用性、目标范围和目标类型。

标注指南已发布并在 Zenodo 上提供。

标注者

标注者是具有大学教育背景和明显优秀加泰罗尼亚语知识（最低水平 C1 或同等水平）的合格专业人士。

个人和敏感信息

不包含个人或敏感信息。

使用数据的注意事项

数据集的社会影响

我们希望这个数据集有助于加泰罗尼亚语这一低资源语言的语言模型的发展。

偏见讨论

[N/A]

其他已知限制

[N/A]

附加信息

数据集策展人

巴塞罗那超级计算中心的语言技术单元 (langtech@bsc.es)

这项工作由加泰罗尼亚政府通过 Aina 项目推动和资助。

许可信息

本工作根据 Creative Commons Attribution Non-commercial 4.0 International 许可进行。

引用信息

bibtex @inproceedings{gonzalez-agirre-etal-2024-building-data, title = "Building a Data Infrastructure for a Mid-Resource Language: The Case of {C}atalan", author = "Gonzalez-Agirre, Aitor and Marimon, Montserrat and Rodriguez-Penagos, Carlos and Aula-Blasco, Javier and Baucells, Irene and Armentano-Oller, Carme and Palomar-Giner, Jorge and Kulebi, Baybars and Villegas, Marta", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.231", pages = "2556--2566", }

贡献

[N/A]

搜集汇总

数据集介绍

构建方式

在加泰罗尼亚语这一中低资源语言的语境下，InToxiCat数据集的构建采用了系统化的专家标注流程。数据源自Racó Català论坛，通过预设关键词筛选出候选句子，确保了语料的真实性与代表性。标注过程分为两个阶段：首先由两名专业标注员独立判断句子的侮辱性，若存在分歧则由第三名标注员仲裁；随后对确认为侮辱性的句子进一步标注侮辱性片段、目标片段、目标类型及侮辱性表达的隐显性。整个流程严格遵循公开的标注指南，并由具备C1及以上加泰罗尼亚语水平的专家执行，保证了标注的一致性与权威性。

使用方法

该数据集主要用于训练和评估加泰罗尼亚语的侮辱性语言检测模型。研究者可直接加载其提供的JSON格式文件，利用`sentence`字段进行文本分类或序列标注任务。对于二元分类，可依据`is_abusive`字段；对于更细粒度的分析，则可利用`abusive_spans`、`target_spans`等字段进行侮辱片段识别或目标检测。数据已预先划分为训练集、验证集和测试集，便于进行标准的机器学习流程。在使用时需注意其CC-BY-NC-4.0许可协议，并考虑加泰罗尼亚语作为目标语言的具体语言特性。

背景与挑战

背景概述

在数字时代，网络空间中的辱骂性语言检测已成为自然语言处理领域的关键研究方向，尤其对于资源稀缺语言而言，构建高质量标注数据集是推动相关技术发展的基石。InToxiCat数据集由巴塞罗那超级计算中心语言技术部门于2024年创建，作为Aina项目的重要组成部分，旨在针对加泰罗尼亚语构建首个专注于辱骂性语言检测的标注语料库。该数据集从Racó Català论坛采集了29,809条句子，并由专业标注人员进行了多层次精细标注，不仅区分句子是否具有辱骂性，还进一步标注了辱骂片段、目标对象类型及辱骂的隐显性等深层特征。其诞生显著弥补了加泰罗尼亚语在有害内容识别领域的数据空白，为开发更公平、更具文化敏感性的语言模型提供了不可或缺的资源支撑。

当前挑战

辱骂性语言检测本身面临诸多固有挑战，包括语义的模糊性、文化语境依赖性以及隐式辱骂（如反讽、讽刺）的精准识别。InToxiCat数据集在构建过程中亦需应对特定难题：其一，加泰罗尼亚语作为中等资源语言，可供挖掘的公开文本数据规模有限，需通过精心设计的关键词策略从论坛中筛选候选语句；其二，辱骂性标注高度依赖主观判断，为确保标注质量，研究团队采用了多轮标注与仲裁机制，并制定了详尽的标注指南以统一标准，同时标注者还需在必要时参考对话上下文以准确判定，这大幅增加了标注的复杂性与成本。这些挑战共同凸显了在低资源语言环境中构建高质量、细粒度语义数据集的艰巨性与创新价值。

常用场景

经典使用场景

在自然语言处理领域，针对低资源语言的在线内容安全监测，InToxiCat数据集为加泰罗尼亚语的滥用语言检测提供了关键资源。其经典使用场景聚焦于文本分类与序列标注任务，通过精细标注的滥用片段、目标类型及隐式表达特征，支持研究者训练和评估模型在复杂语境下识别仇恨言论、人身攻击等有害内容的能力。该数据集平衡的分布与多层次注释结构，使其成为开发跨语言毒性检测系统的重要基准。

解决学术问题

该数据集有效解决了低资源语言在数字空间内容治理中的学术研究难题。通过提供大规模、高质量加泰罗尼亚语标注数据，填补了该语言在滥用语言检测领域的资源空白，促进了语言模型在少数语言环境下的公平发展。其标注体系涵盖显性与隐性滥用、目标群体细分等维度，为探究文化特定语境下的语言暴力机制提供了实证基础，推动了计算语言学与社会科学的交叉研究。

实际应用

在实际应用层面，InToxiCat数据集为加泰罗尼亚语区域的在线平台内容审核系统提供了核心技术支撑。社交媒体论坛、新闻评论区等数字空间可借助基于该数据集训练的模型，自动识别并过滤针对个人或群体的侮辱性言论，维护网络交流环境的健康秩序。同时，其隐式滥用检测功能有助于应对日益复杂的讽刺、影射等间接攻击形式，提升内容管理系统的语义理解深度与适应性。

数据集最近研究