agentlans/PleIAs-ToxicCommons

Name: agentlans/PleIAs-ToxicCommons
Creator: agentlans
Published: 2025-02-23 14:31:55
License: 暂无描述

Hugging Face2025-02-23 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/agentlans/PleIAs-ToxicCommons

下载链接

链接失效反馈

官方服务：

资源简介：

PleIAs/ToxicCommons数据集是一个经过优化的版本，专注于标记历史文本中可能违反现代标准的内容（数据集创建者认为的“有毒”内容）。清洗后的数据集包含1,051,027行，每行代表一个文本样本，包含五个维度的毒性分数：基于种族和起源的偏见、基于性别和性取向的偏见、宗教偏见、能力偏见以及暴力和虐待。分数及其总和被提供在不同的列中，便于分析。数据集经过去重、排除高比例数字和符号的文本、仅包含超过1000字符的文本、Unicode归一化等预处理步骤。对于`filtered`配置，样本根据毒性分数使用BIRCH算法进行聚类，移除最大的簇以去除数据集中过多的非毒性文本，然后将过滤后的数据随机分为80%的训练集和20%的验证集。

This dataset is a refined version of the PleIAs/ToxicCommons collection, focusing on historical texts labeled for content that may be considered objectionable by modern standards (what the authors of the dataset deem toxic). The cleaned dataset contains 1,051,027 rows, each representing a text sample with associated toxicity scores across five dimensions: Race and origin-based bias, Gender and sexuality-based bias, Religious bias, Ability bias, Violence and abuse. Scores and their sums are provided in separate columns for easy analysis. The dataset has undergone preprocessing steps including duplicate removal, exclusion of texts with high proportions of numbers and symbols, inclusion of only texts longer than 1000 characters, and normalization of Unicode, whitespace, quotation marks, hyphenated words, and bullet points. For the filtered configuration, samples were clustered by toxicity scores using the BIRCH algorithm, and the largest cluster was removed to reduce the overrepresentation of non-toxic texts in the dataset, which was then split into an 80% training set and a 20% validation set.

提供机构：

agentlans

5,000+

优质数据集

54 个

任务类型

进入经典数据集