five

heegyu/toxic-spans

收藏
Hugging Face2023-03-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/heegyu/toxic-spans
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 --- # Toxic Spans TOXICSPANS contains the 11,035 posts we annotated for toxic spans. The unique posts are actually 11,006, since a few were duplicates and were removed in subsequent experiments. A few other posts were used as quiz questions to check the reliability of candidate annotators and were also discarded in subsequent experiments. original data from https://github.com/ipavlopoulos/toxic_spans/tree/master/ACL2022 - 10006 train set - 1000 test set ## Columns - probability = a dict with the first and the last character offsets of each token (that was rated by at least one annotator as toxic) as a key, and the average toxicity as a value - position = the character offsets of all the toxic spans(avg toxicity > 50%) found by the annotators (ground truth) - text = the average toxicity of each token that was rated by at least one annotator as toxic - type = the type of toxicity of each toxic span - support = the number of annotators per post - text_of_post = the text of the post - position_probability = the average toxicity of each character offset that was found by at least one annotator as toxic - toxic = (Not in original) If the probability of at least 1 token is greather than 0.5 ### Sample ``` { "probability": { "(5, 11)": 1.0, "(286, 294)": 0.6666666667, "(120, 126)": 0.6666666667, "(350, 356)": 0.6666666667 }, "position": [ 5, 6, 7, 8, 9, 10, 120, 121, 122, 123, 124, 125, 286, 287, 288, 289, 290, 291, 292, 293, 350, 351, 352, 353, 354, 355 ], "text": { "stupid": 1.0, "ignorant": 0.6666666667, "Stupid": 0.6666666667 }, "type": { "profane\/obscene": 0.3333333333, "insult": 0.6666666667 }, "support": 3, "text_of_post": "Yes, stupid on steroids does afflict the nation. The biggest problem, of course, is they either don't see themselves as stupid, or are so proud of the fact they are they have no intention of remedying the situation. In fact, that's the definition of stupid in my book: You know you're ignorant, proud of it, and have no intention of alleviating it. Stupid. \n\nI wonder how they'd like their doctors to say to them \"Oh, I didn't go to medical school; that's for elites. The need for a formal education is fake news. I studied at home for a couple of years and got an alternative medical education. Trust me, I'm as good a doctor as any. Now, when did you want to schedule that surgery?\" I wonder how they'd like an unlicensed pilot in charge of getting them from point A to Point B? \n\nI think they're all just lazy. They want all the benefits of an education, but don't want to put in the time.", "position_probability": { "5": 1.0, "6": 1.0, "7": 1.0, "8": 1.0, "9": 1.0, "10": 1.0, "286": 0.6666666667, "287": 0.6666666667, "288": 0.6666666667, "289": 0.6666666667, "290": 0.6666666667, "291": 0.6666666667, "292": 0.6666666667, "293": 0.6666666667, "120": 0.6666666667, "121": 0.6666666667, "122": 0.6666666667, "123": 0.6666666667, "124": 0.6666666667, "125": 0.6666666667, "350": 0.6666666667, "351": 0.6666666667, "352": 0.6666666667, "353": 0.6666666667, "354": 0.6666666667, "355": 0.6666666667 }, "toxic": true } ```

--- 许可证:CC0 1.0 --- # 毒性跨度数据集(Toxic Spans) 本数据集共收录11035条经人工标注的毒性跨度(toxic span)文本帖子。由于存在部分重复样本,且少量样本被用作测试候选标注者(annotator)可靠性的考题,最终在后续实验中移除了重复样本与测试样本,实际有效唯一帖子数量为11006条。 原始数据源自:https://github.com/ipavlopoulos/toxic_spans/tree/master/ACL2022 - 训练集:10006条样本 - 测试集:1000条样本 ## 字段说明 - probability(毒性概率字典):以经至少一名标注者标注为有毒的Token(Token)的首尾字符偏移量组成的元组作为键,以该Token的平均毒性评分为值的字典 - position(毒性跨度位置):标注者标注的所有平均毒性评分大于50%的有毒跨度的字符偏移量集合(即真值标签(ground truth)) - text(有毒Token文本与平均毒性):以经至少一名标注者标注为有毒的Token文本为键,对应Token的平均毒性评分为值的字典 - type(毒性类型分布):各有毒跨度对应的毒性类型及其标注占比的字典 - support(标注者人数):当前帖子的总标注者数量 - text_of_post(帖子原文):当前待标注帖子的完整原始文本 - position_probability(字符位置毒性概率):经至少一名标注者标注为有毒的各字符偏移量对应的平均毒性评分 - toxic(样本毒性标签,非原始字段):若至少一个Token的毒性概率大于0.5,则该字段为真 ### 样本示例 { "probability": { "(5, 11)": 1.0, "(286, 294)": 0.6666666667, "(120, 126)": 0.6666666667, "(350, 356)": 0.6666666667 }, "position": [ 5, 6, 7, 8, 9, 10, 120, 121, 122, 123, 124, 125, 286, 287, 288, 289, 290, 291, 292, 293, 350, 351, 352, 353, 354, 355 ], "text": { "stupid": 1.0, "ignorant": 0.6666666667, "Stupid": 0.6666666667 }, "type": { "profane/obscene": 0.3333333333, "insult": 0.6666666667 }, "support": 3, "text_of_post": "Yes, stupid on steroids does afflict the nation. The biggest problem, of course, is they either don't see themselves as stupid, or are so proud of the fact they are they have no intention of remedying the situation. In fact, that's the definition of stupid in my book: You know you're ignorant, proud of it, and have no intention of alleviating it. Stupid. I wonder how they'd like their doctors to say to them "Oh, I didn't go to medical school; that's for elites. The need for a formal education is fake news. I studied at home for a couple of years and got an alternative medical education. Trust me, I'm as good a doctor as any. Now, when did you want to schedule that surgery?" I wonder how they'd like an unlicensed pilot in charge of getting them from point A to Point B? I think they're all just lazy. They want all the benefits of an education, but don't want to put in the time.", "position_probability": { "5": 1.0, "6": 1.0, "7": 1.0, "8": 1.0, "9": 1.0, "10": 1.0, "286": 0.6666666667, "287": 0.6666666667, "288": 0.6666666667, "289": 0.6666666667, "290": 0.6666666667, "291": 0.6666666667, "292": 0.6666666667, "293": 0.6666666667, "120": 0.6666666667, "121": 0.6666666667, "122": 0.6666666667, "123": 0.6666666667, "124": 0.6666666667, "125": 0.6666666667, "350": 0.6666666667, "351": 0.6666666667, "352": 0.6666666667, "353": 0.6666666667, "354": 0.6666666667, "355": 0.6666666667 }, "toxic": true }
提供机构:
heegyu
原始信息汇总

数据集概述

数据集名称

  • 名称: Toxic Spans

数据集内容

数据集结构

  • 列信息:
    • probability: 字典格式,包含每个被至少一位标注者评定为有毒的词的起始和结束字符偏移量,以及平均毒性评分。
    • position: 所有被标注者发现的有毒片段的字符偏移量(平均毒性超过50%)。
    • text: 每个被至少一位标注者评定为有毒的词的平均毒性评分。
    • type: 每个有毒片段的毒性类型。
    • support: 每个帖子的标注者数量。
    • text_of_post: 帖子的文本内容。
    • position_probability: 每个被至少一位标注者发现的有毒字符偏移量的平均毒性评分。
    • toxic: 若至少有一个词的毒性概率大于0.5,则为真。

示例数据

  • 示例: 提供了一个具体的数据样本,展示了如何通过字典和列表格式存储毒性信息及其相关属性。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含11,035条标注了有毒文本跨度的帖子,用于毒性检测任务。数据集分为训练集和测试集,每条数据包含有毒字符偏移、毒性类型、帖子文本等信息,适用于自然语言处理中的毒性识别研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作