Matrix430/CONDA

Name: Matrix430/CONDA
Creator: Matrix430
Published: 2022-11-30 07:03:52
License: 暂无描述

Hugging Face2022-11-30 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Matrix430/CONDA

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - en language_creators: - found license: - afl-3.0 multilinguality: - monolingual pretty_name: CONDA size_categories: - 10K<n<100K source_datasets: - original tags: - CONDA task_categories: - text-classification - token-classification task_ids: - intent-classification --- # Dataset Card for CONDA ## Table of Contents - [Dataset Description](#dataset-description) - [Abstract](#dataset-summary) - [Leaderboards](#leaderboards) - [Evaluation Metrics](#evaluation-metrics) - [Languages](#languages) - [Video](#video) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [CONDA](https://github.com/usydnlp/CONDA) - **Paper:** [CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection](https://arxiv.org/abs/2106.06213) - **Point of Contact:** [Caren Han](caren.han@sydney.edu.au) ## Dataset Summary Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets. ## Leaderboards The Codalab leaderboard can be found at: https://codalab.lisn.upsaclay.fr/competitions/7827 ### Evaluation Metrics **JSA**(Joint Semantic Accuracy) is used for ranking. An utterance is deemed correctly analysed only if both utterance-level and all the token-level labels including Os are correctly predicted. Besides, the f1 score of **utterance-level** E(xplicit) and I(mplicit) classes, **token-level** T(oxicity), D(ota-specific), S(game Slang) classes will be shown on the leaderboard (but not used as the ranking metric). ## Languages English ## Video Please enjoy a video presentation covering the main points from our paper: <p align="centre"> [![ACL_video](https://img.youtube.com/vi/qRCPSSUuf18/0.jpg)](https://www.youtube.com/watch?v=qRCPSSUuf18) </p> ## Citation Information ``` @inproceedings{weld-etal-2021-conda, title = "{CONDA}: a {CON}textual Dual-Annotated dataset for in-game toxicity understanding and detection", author = "Weld, Henry and Huang, Guanghao and Lee, Jean and Zhang, Tongshu and Wang, Kunze and Guo, Xinghong and Long, Siqu and Poon, Josiah and Han, Caren", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.213", doi = "10.18653/v1/2021.findings-acl.213", pages = "2406--2416", } ```

提供机构：

Matrix430

原始信息汇总

数据集概述

基本信息

名称: CONDA
语言: 英语
许可证: AFL-3.0
多语言性: 单语
大小: 10K<n<100K
数据来源: 原始数据
标签: CONDA, 文本分类, 令牌分类
任务类别: 文本分类, 令牌分类
任务ID: 意图分类

数据集描述

目的: 用于游戏内毒性语言检测，支持联合意图分类和槽填充分析。
内容: 包含45K条来自12K次对话的语句，这些对话来自1.9K场完成的Dota 2比赛聊天记录。
特点: 提出了一种强大的双重语义级毒性框架，处理语句和令牌级模式，以及丰富的上下文聊天历史。

评估指标

主要指标: 联合语义准确性(JSA)，用于排名。
其他指标: 语句级别和令牌级别的F1分数，包括显式(E)、隐式(I)、毒性(T)、Dota特定(D)和游戏俚语(S)类。

引用信息

@inproceedings{weld-etal-2021-conda, title = "{CONDA}: a {CON}textual Dual-Annotated dataset for in-game toxicity understanding and detection", author = "Weld, Henry and Huang, Guanghao and Lee, Jean and Zhang, Tongshu and Wang, Kunze and Guo, Xinghong and Long, Siqu and Poon, Josiah and Han, Caren", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.213", doi = "10.18653/v1/2021.findings-acl.213", pages = "2406--2416", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集