austrotox
收藏AustroTox 数据集概述
基本信息
- 数据集名称: AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
- 任务类别: 文本分类
- 语言: 德语
- 标签: 毒性、冒犯性、仇恨、透视主义、主观性、片段
数据集描述
AustroTox 是一个用于基于目标的奥地利德语冒犯性语言检测的数据集,源自一个新闻论坛,并包含了奥地利德语方言。该数据集包含 4,562 条用户评论。
数据字段说明
- Index: 评论的索引。
- Article Title: 评论所发布文章的标题。
- Comment: 来自 Jigsaw Toxic Comment Classification Challenge 的文本,该文本已被分类。
- Label: 表示聚合标签是否为“有毒/冒犯性”(1)或“无毒/非冒犯性”(0)。
- Annotators not toxic: 将文本标注为“无毒/非冒犯性”的标注者ID。
- Annotators toxic: 将文本标注为“有毒/冒犯性”的标注者ID。
- Manually cleaned: 指示标注是否经过手动清理。
- Label fine: 帖子的细粒度标签。
- Tags: 文本片段及其对应的标签。
相关论文
- 标题: AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
- 出处: Findings of the Association for Computational Linguistics: ACL 2024
- 链接: https://aclanthology.org/2024.findings-acl.713/
- DOI: 10.18653/v1/2024.findings-acl.713
- 摘要: 该研究引入了为冒犯性语言检测标注的数据集,其特点在于包含了奥地利德语方言。除了二元冒犯性分类外,还识别了每条评论中构成粗俗语言或代表冒犯性陈述目标的文本片段。研究评估了微调的 Transformer 模型以及零样本和少样本设置下的大语言模型。结果表明,微调模型在检测粗俗方言等语言特性方面表现优异,而大语言模型在检测 AustroTox 中的冒犯性方面表现出更优的性能。
引用信息
如果使用此数据集,请引用:
@inproceedings{pachinger-etal-2024-austrotox, title = "{A}ustro{T}ox: A Dataset for Target-Based {A}ustrian {G}erman Offensive Language Detection", author = "Pachinger, Pia and Goldzycher, Janis and Planitzer, Anna and Kusa, Wojciech and Hanbury, Allan and Neidhardt, Julia", editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek", booktitle = "Findings of the Association for Computational Linguistics: ACL 2024", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-acl.713/", doi = "10.18653/v1/2024.findings-acl.713", pages = "11990--12001", abstract = "Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently, such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned Transformer models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox." }




