five

grammarly/detexd-benchmark

收藏
Hugging Face2023-07-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/grammarly/detexd-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
DeTexD数据集是一个用于精细文本检测的基准数据集,专注于识别情感丰富或可能引起不适的文本。该数据集通过多个注释者的评分来量化文本的风险性,评分范围从0到5,其中0表示无风险,5表示高风险。此外,数据集还提供了平均的二元标签,用于区分文本的负面(0)和正面(1)倾向。

DeTexD数据集是一个用于精细文本检测的基准数据集,专注于识别情感丰富或可能引起不适的文本。该数据集通过多个注释者的评分来量化文本的风险性,评分范围从0到5,其中0表示无风险,5表示高风险。此外,数据集还提供了平均的二元标签,用于区分文本的负面(0)和正面(1)倾向。
提供机构:
grammarly
原始信息汇总

数据集概述

基本信息

  • 数据集名称: DeTexD: A Benchmark Dataset for Delicate Text Detection
  • 许可证: Apache-2.0
  • 任务类别: 文本分类
  • 语言: 英语
  • 数据集大小: 1K<n<10K

数据集结构

数据实例

  • 文本: 待分类的文本内容
  • annotator_1: 标注者1的评分(0-5)
  • annotator_2: 标注者2的评分(0-5)
  • annotator_3: 标注者3的评分(0-5)
  • label: 平均二元评分(>=3),分为“negative”(0)或“positive”(1)

数据字段

  • text: 待分类的文本
  • annotator_1: 标注者1的评分(0-5)
  • annotator_2: 标注者2的评分(0-5)
  • annotator_3: 标注者3的评分(0-5)
  • label: 平均二元评分,分为“negative”(0)或“positive”(1)

数据分割

分割名称 示例数量
test 1023

引用信息

@inproceedings{chernodub-etal-2023-detexd, title = "{D}e{T}ex{D}: A Benchmark Dataset for Delicate Text Detection", author = "Yavnyi, Serhii and Sliusarenko, Oleksii and Razzaghi, Jade and Mo, Yichen and Hovakimyan, Knar and Chernodub, Artem", booktitle = "The 7th Workshop on Online Abuse and Harms (WOAH)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.woah-1.2", pages = "14--28", abstract = "Over the past few years, much research has been conducted to identify and regulate toxic language. However, few studies have addressed a broader range of sensitive texts that are not necessarily overtly toxic. In this paper, we introduce and define a new category of sensitive text called {``}delicate text.{} We provide the taxonomy of delicate text and present a detailed annotation scheme. We annotate DeTexD, the first benchmark dataset for delicate text detection. The significance of the difference in the definitions is highlighted by the relative performance deltas between models trained each definitions and corpora and evaluated on the other. We make publicly available the DeTexD Benchmark dataset, annotation guidelines, and baseline model for delicate text detection.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作