Matrix430/CONDA
收藏Hugging Face2022-11-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Matrix430/CONDA
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- en
language_creators:
- found
license:
- afl-3.0
multilinguality:
- monolingual
pretty_name: CONDA
size_categories:
- 10K<n<100K
source_datasets:
- original
tags:
- CONDA
task_categories:
- text-classification
- token-classification
task_ids:
- intent-classification
---
# Dataset Card for CONDA
## Table of Contents
- [Dataset Description](#dataset-description)
- [Abstract](#dataset-summary)
- [Leaderboards](#leaderboards)
- [Evaluation Metrics](#evaluation-metrics)
- [Languages](#languages)
- [Video](#video)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [CONDA](https://github.com/usydnlp/CONDA)
- **Paper:** [CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection](https://arxiv.org/abs/2106.06213)
- **Point of Contact:** [Caren Han](caren.han@sydney.edu.au)
## Dataset Summary
Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets.
## Leaderboards
The Codalab leaderboard can be found at: https://codalab.lisn.upsaclay.fr/competitions/7827
### Evaluation Metrics
**JSA**(Joint Semantic Accuracy) is used for ranking. An utterance is deemed correctly analysed only if both utterance-level and all the token-level labels including Os are correctly predicted.
Besides, the f1 score of **utterance-level** E(xplicit) and I(mplicit) classes, **token-level** T(oxicity), D(ota-specific), S(game Slang) classes will be shown on the leaderboard (but not used as the ranking metric).
## Languages
English
## Video
Please enjoy a video presentation covering the main points from our paper:
<p align="centre">
[](https://www.youtube.com/watch?v=qRCPSSUuf18)
</p>
## Citation Information
```
@inproceedings{weld-etal-2021-conda,
title = "{CONDA}: a {CON}textual Dual-Annotated dataset for in-game toxicity understanding and detection",
author = "Weld, Henry and
Huang, Guanghao and
Lee, Jean and
Zhang, Tongshu and
Wang, Kunze and
Guo, Xinghong and
Long, Siqu and
Poon, Josiah and
Han, Caren",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.213",
doi = "10.18653/v1/2021.findings-acl.213",
pages = "2406--2416",
}
```
提供机构:
Matrix430
原始信息汇总
数据集概述
基本信息
- 名称: CONDA
- 语言: 英语
- 许可证: AFL-3.0
- 多语言性: 单语
- 大小: 10K<n<100K
- 数据来源: 原始数据
- 标签: CONDA, 文本分类, 令牌分类
- 任务类别: 文本分类, 令牌分类
- 任务ID: 意图分类
数据集描述
- 目的: 用于游戏内毒性语言检测,支持联合意图分类和槽填充分析。
- 内容: 包含45K条来自12K次对话的语句,这些对话来自1.9K场完成的Dota 2比赛聊天记录。
- 特点: 提出了一种强大的双重语义级毒性框架,处理语句和令牌级模式,以及丰富的上下文聊天历史。
评估指标
- 主要指标: 联合语义准确性(JSA),用于排名。
- 其他指标: 语句级别和令牌级别的F1分数,包括显式(E)、隐式(I)、毒性(T)、Dota特定(D)和游戏俚语(S)类。
引用信息
@inproceedings{weld-etal-2021-conda, title = "{CONDA}: a {CON}textual Dual-Annotated dataset for in-game toxicity understanding and detection", author = "Weld, Henry and Huang, Guanghao and Lee, Jean and Zhang, Tongshu and Wang, Kunze and Guo, Xinghong and Long, Siqu and Poon, Josiah and Han, Caren", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.213", doi = "10.18653/v1/2021.findings-acl.213", pages = "2406--2416", }



