dleemiller/toxic-pairs
收藏Hugging Face2024-10-31 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/dleemiller/toxic-pairs
下载链接
链接失效反馈官方服务:
资源简介:
ToxicPairs数据集是一个包含英语句子的数据集,主要用于句子相似性任务。该数据集标记为不适合所有观众,因为它包含冒犯性语言。数据集的大小在10万到100万之间,包含sentence1, sentence2, score, label四列。数据集的内容来自多个源数据集,如Youtube有毒评论、Jigsaw和网络欺凌数据集。数据集的创建过程包括通过LlamaGuard3对文本进行分类,使用WordLlama进行模糊去重,并使用BM25s索引进行检索和排序。数据集的目的是帮助嵌入模型理解有毒内容,用于内容审核和有毒对话分类等任务。
This is a dataset containing offensive language, not suitable for all audiences. The dataset includes sentence pairs, similarity scores, and labels. The labels are categorized into various types such as violent crimes, non-violent crimes, sex-related crimes, etc. The training set has 177549 rows, and the test set has 10000 rows. The creation process of the dataset involves text classification using LlamaGuard3 and fuzzy deduplication using WordLlama. The goal of this dataset is to help embedding models understand toxic content for tasks like content moderation and benchmarks like ToxicConversationsClassification.
提供机构:
dleemiller



