NoraAlt/Mawqif_Stance-Detection

Name: NoraAlt/Mawqif_Stance-Detection
Creator: NoraAlt
Published: 2024-01-18 10:11:13
License: 暂无描述

Hugging Face2024-01-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NoraAlt/Mawqif_Stance-Detection

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text-classification language: - ar pretty_name: 'Mawqif: Stance Detection' size_categories: - 1K<n<10K tags: - Stance Detection - Sentiment Analysis - Sarcasm Detection --- # Mawqif: A Multi-label Arabic Dataset for Target-specific Stance Detection - *Mawqif* is the first Arabic dataset that can be used for target-specific stance detection. - This is a multi-label dataset where each data point is annotated for stance, sentiment, and sarcasm. - We benchmark *Mawqif* dataset on the stance detection task and evaluate the performance of four BERT-based models. Our best model achieves a macro-F1 of 78.89\%. # Mawqif Statistics - This dataset consists of **4,121** tweets in multi-dialectal Arabic. Each tweet is annotated with a stance toward one of three targets: “COVID-19 vaccine,” “digital transformation,” and “women empowerment.” In addition, it is annotated with sentiment and sarcasm polarities. - The following figure illustrates the labels’ distribution across all targets, and the distribution per target. <img width="738" alt="dataStat-2" src="https://user-images.githubusercontent.com/31368075/188299057-54d04e87-802d-4b0e-b7c6-56bdc1078284.png"> # Interactive Visualization To browse an interactive visualization of the *Mawqif* dataset, please click [here](https://public.tableau.com/views/MawqifDatasetDashboard/Dashboard1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link) - *You can click on visualization components to filter the data by target and by class. **For example,** you can click on “women empowerment" and "against" to get the information of tweets that express against women empowerment.* # Citation If you feel our paper and resources are useful, please consider citing our work! ``` @inproceedings{alturayeif-etal-2022-mawqif, title = "Mawqif: A Multi-label {A}rabic Dataset for Target-specific Stance Detection", author = "Alturayeif, Nora Saleh and Luqman, Hamzah Abdullah and Ahmed, Moataz Aly Kamaleldin", booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.wanlp-1.16", pages = "174--184" } ```

--- task_categories: - 文本分类 language: - 阿拉伯语 pretty_name: 'Mawqif：立场检测(Stance Detection)' size_categories: - 样本量介于1000至10000之间 tags: - 立场检测(Stance Detection) - 情感分析(Sentiment Analysis) - 讽刺检测(Sarcasm Detection) --- # Mawqif：面向目标特定立场检测的多标签阿拉伯语数据集 - *Mawqif*是首个可用于目标特定立场检测的阿拉伯语数据集。 - 本数据集为多标签类型，每条数据均针对立场、情感与讽刺性进行标注。 - 我们在立场检测任务上对Mawqif数据集开展了基准测试，并评估了4种基于BERT的模型的性能，其中最优模型的宏F1值达到78.89%。 # Mawqif 统计信息 - 本数据集包含**4121**条多方言阿拉伯语推文。每条推文均针对三个目标之一进行立场标注：“新冠疫苗(COVID-19 vaccine)”、“数字化转型(digital transformation)”与“赋权女性(women empowerment)”。此外，推文还标注了情感与讽刺极性。 - 下图展示了全目标维度下的标签分布情况，以及分目标的标签分布情况。 <img width="738" alt="dataStat-2" src="https://user-images.githubusercontent.com/31368075/188299057-54d04e87-802d-4b0e-b7c6-56bdc1078284.png"> # 交互式可视化如需浏览Mawqif数据集的交互式可视化内容，请点击[此处](https://public.tableau.com/views/MawqifDatasetDashboard/Dashboard1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link) - 您可点击可视化组件，按目标与类别筛选数据。**例如**，您可点击“赋权女性”与“反对”，即可获取表达反对赋权女性立场的推文相关信息。 # 引用若您认为本文与相关资源对您的研究有所帮助，请考虑引用本工作！ @inproceedings{alturayeif-etal-2022-mawqif, title = "Mawqif: A Multi-label {A}rabic Dataset for Target-specific Stance Detection", author = "Alturayeif, Nora Saleh and Luqman, Hamzah Abdullah and Ahmed, Moataz Aly Kamaleldin", booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.wanlp-1.16", pages = "174--184" }

提供机构：

NoraAlt

原始信息汇总

Mawqif: A Multi-label Arabic Dataset for Target-specific Stance Detection

概述

名称：Mawqif
任务类别：文本分类
语言：阿拉伯语
数据集大小：1K<n<10K
标签：立场检测、情感分析、讽刺检测

详细信息

数据集类型：多标签数据集，每个数据点标注了立场、情感和讽刺。
数据集内容：包含4,121条多方言阿拉伯语推文，每条推文针对以下三个目标之一标注立场：“COVID-19疫苗”、“数字化转型”和“女性赋权”。此外，还标注了情感和讽刺倾向。
模型评估：在立场检测任务上，使用四种基于BERT的模型进行基准测试，最佳模型达到78.89%的宏F1分数。

可视化

提供交互式可视化工具，可按目标和类别过滤数据。

引用

@inproceedings{alturayeif-etal-2022-mawqif, title = "Mawqif: A Multi-label {A}rabic Dataset for Target-specific Stance Detection", author = "Alturayeif, Nora Saleh and Luqman, Hamzah Abdullah and Ahmed, Moataz Aly Kamaleldin", booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.wanlp-1.16", pages = "174--184" }

搜集汇总

数据集介绍

构建方式

在构建Mawqif数据集时，研究者们精心收集了4,121条阿拉伯语推文，这些推文涵盖了多样的阿拉伯方言。每条推文均被标注了针对特定目标（如COVID-19疫苗、数字化转型和女性赋权）的立场、情感和讽刺倾向。通过这种多标签的标注方式，数据集不仅丰富了立场检测的维度，还为情感分析和讽刺检测提供了宝贵的资源。

特点

Mawqif数据集的显著特点在于其多标签的标注方式，这使得每条数据点都包含了立场、情感和讽刺的多重信息。此外，该数据集是首个针对阿拉伯语的特定目标立场检测数据集，填补了该领域的空白。其多样化的标注和多目标的特性，使得Mawqif在立场检测、情感分析和讽刺检测等多个自然语言处理任务中具有广泛的应用潜力。

使用方法

使用Mawqif数据集时，研究者可以通过加载数据集并进行预处理，以适应不同的自然语言处理模型。例如，可以利用BERT等基于Transformer的模型进行立场检测任务。数据集还提供了交互式可视化工具，用户可以通过该工具直观地浏览和筛选数据，从而更好地理解数据分布和特征。此外，引用相关文献以确保学术诚信和研究的可追溯性也是使用该数据集的重要步骤。

背景与挑战

背景概述

在自然语言处理领域，立场检测（Stance Detection）是理解和分析文本情感与态度的重要任务。Mawqif数据集由Nora Saleh Alturayeif、Hamzah Abdullah Luqman和Moataz Aly Kamaleldin于2022年创建，是首个针对特定目标的阿拉伯语立场检测数据集。该数据集包含4,121条多方言阿拉伯语推文，每条推文均标注了针对COVID-19疫苗、数字化转型和女性赋权三个目标的立场、情感和讽刺倾向。Mawqif数据集的推出，填补了阿拉伯语立场检测领域的空白，为相关研究提供了宝贵的资源，并推动了多标签分类技术在阿拉伯语处理中的应用。

当前挑战

Mawqif数据集在构建过程中面临多重挑战。首先，多方言阿拉伯语的复杂性增加了文本预处理的难度，确保数据的一致性和准确性成为一大挑战。其次，立场、情感和讽刺的多标签标注任务要求高度精细的注释，这不仅增加了数据标注的工作量，还对标注者的专业素养提出了高要求。此外，数据集在立场检测任务中的性能评估也面临挑战，尤其是在处理多标签分类问题时，如何平衡不同标签的权重以提高模型的整体性能，是当前研究的重点和难点。

常用场景

经典使用场景

在自然语言处理领域，Mawqif数据集的经典使用场景主要集中在目标特定立场检测任务上。该数据集包含4,121条阿拉伯语推文，每条推文都被标注了针对特定目标（如COVID-19疫苗、数字化转型和女性赋权）的立场、情感和讽刺倾向。通过利用这一多标签数据集，研究者可以训练和评估BERT等基于Transformer的模型，以实现对阿拉伯语社交媒体内容的高效立场检测。

衍生相关工作

基于Mawqif数据集，研究者们已开展了一系列相关工作，包括但不限于改进立场检测模型的算法、探索多任务学习在立场检测中的应用，以及开发跨语言立场检测模型。这些工作不仅提升了阿拉伯语立场检测的准确性，还为其他低资源语言的立场检测研究提供了借鉴。此外，该数据集的发布也激发了更多关于多标签数据集构建和应用的研究，推动了自然语言处理领域的技术进步。

数据集最近研究