Catalonia Independence Corpus (CIC)

Name: Catalonia Independence Corpus (CIC)
Creator: 维科姆科技基金会巴斯克研究与技术联盟
Published: 2021-01-28 21:05:09
License: 暂无描述

arXiv2021-01-28 更新2024-06-21 收录

下载链接：

https://github.com/ZotovaElena/Multilingual-Stance-Detection

下载链接

链接失效反馈

官方服务：

资源简介：

Catalonia Independence Corpus (CIC)是由维科姆科技基金会巴斯克研究与技术联盟创建的多语言数据集，专注于推特上的立场检测。该数据集通过半自动方法收集并标注了超过30万条西班牙语和加泰罗尼亚语的推文，旨在研究用户对加泰罗尼亚独立问题的立场。数据集的创建过程包括用户基础标注、用户关系分析、关键词和话题模型应用，确保了数据的平衡性和大规模性。CIC数据集不仅支持单语言分析，还适用于跨语言实验，特别是在使用如mBERT和XLM-RoBERTa等大型多语言语言模型时，展现了其在社交媒体分析和自然语言处理领域的应用潜力。

Catalonia Independence Corpus (CIC) is a multilingual dataset focused on stance detection on Twitter, created by the Vikomtech Foundation – Basque Research and Technology Alliance. This dataset comprises over 300,000 Spanish and Catalan tweets collected and annotated via a semi-automatic methodology, with the core objective of investigating users' stances towards the Catalonia independence issue. The dataset's creation process encompasses user-based annotation, user relationship analysis, and the application of keyword and topic models, ensuring the dataset's balance and large-scale nature. The CIC dataset not only supports monolingual analysis but also enables cross-lingual experiments, particularly when leveraging large multilingual language models such as mBERT and XLM-RoBERTa, demonstrating its application potential in the fields of social media analysis and natural language processing.

提供机构：

维科姆科技基金会巴斯克研究与技术联盟

创建时间：

2021-01-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集