five

Catalonia Independence Corpus (CIC)

收藏
arXiv2021-01-28 更新2024-06-21 收录
下载链接:
https://github.com/ZotovaElena/Multilingual-Stance-Detection
下载链接
链接失效反馈
官方服务:
资源简介:
Catalonia Independence Corpus (CIC)是由维科姆科技基金会巴斯克研究与技术联盟创建的多语言数据集,专注于推特上的立场检测。该数据集通过半自动方法收集并标注了超过30万条西班牙语和加泰罗尼亚语的推文,旨在研究用户对加泰罗尼亚独立问题的立场。数据集的创建过程包括用户基础标注、用户关系分析、关键词和话题模型应用,确保了数据的平衡性和大规模性。CIC数据集不仅支持单语言分析,还适用于跨语言实验,特别是在使用如mBERT和XLM-RoBERTa等大型多语言语言模型时,展现了其在社交媒体分析和自然语言处理领域的应用潜力。

Catalonia Independence Corpus (CIC) is a multilingual dataset focused on stance detection on Twitter, created by the Vikomtech Foundation – Basque Research and Technology Alliance. This dataset comprises over 300,000 Spanish and Catalan tweets collected and annotated via a semi-automatic methodology, with the core objective of investigating users' stances towards the Catalonia independence issue. The dataset's creation process encompasses user-based annotation, user relationship analysis, and the application of keyword and topic models, ensuring the dataset's balance and large-scale nature. The CIC dataset not only supports monolingual analysis but also enables cross-lingual experiments, particularly when leveraging large multilingual language models such as mBERT and XLM-RoBERTa, demonstrating its application potential in the fields of social media analysis and natural language processing.
提供机构:
维科姆科技基金会巴斯克研究与技术联盟
创建时间:
2021-01-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作