grammarly/detexd-benchmark
收藏数据集概述
基本信息
- 数据集名称: DeTexD: A Benchmark Dataset for Delicate Text Detection
- 许可证: Apache-2.0
- 任务类别: 文本分类
- 语言: 英语
- 数据集大小: 1K<n<10K
数据集结构
数据实例
- 文本: 待分类的文本内容
- annotator_1: 标注者1的评分(0-5)
- annotator_2: 标注者2的评分(0-5)
- annotator_3: 标注者3的评分(0-5)
- label: 平均二元评分(>=3),分为“negative”(0)或“positive”(1)
数据字段
text: 待分类的文本annotator_1: 标注者1的评分(0-5)annotator_2: 标注者2的评分(0-5)annotator_3: 标注者3的评分(0-5)label: 平均二元评分,分为“negative”(0)或“positive”(1)
数据分割
| 分割名称 | 示例数量 |
|---|---|
| test | 1023 |
引用信息
@inproceedings{chernodub-etal-2023-detexd, title = "{D}e{T}ex{D}: A Benchmark Dataset for Delicate Text Detection", author = "Yavnyi, Serhii and Sliusarenko, Oleksii and Razzaghi, Jade and Mo, Yichen and Hovakimyan, Knar and Chernodub, Artem", booktitle = "The 7th Workshop on Online Abuse and Harms (WOAH)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.woah-1.2", pages = "14--28", abstract = "Over the past few years, much research has been conducted to identify and regulate toxic language. However, few studies have addressed a broader range of sensitive texts that are not necessarily overtly toxic. In this paper, we introduce and define a new category of sensitive text called {``}delicate text.{} We provide the taxonomy of delicate text and present a detailed annotation scheme. We annotate DeTexD, the first benchmark dataset for delicate text detection. The significance of the difference in the definitions is highlighted by the relative performance deltas between models trained each definitions and corpora and evaluated on the other. We make publicly available the DeTexD Benchmark dataset, annotation guidelines, and baseline model for delicate text detection.", }



