lorelupo/dadit_italian_twitter

Name: lorelupo/dadit_italian_twitter
Creator: lorelupo
Published: 2024-05-29 16:04:34
License: 暂无描述

Hugging Face2024-05-29 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/lorelupo/dadit_italian_twitter

下载链接

链接失效反馈

官方服务：

资源简介：

DADIT是一个历时性的意大利Twitter数据集，包含20K意大利Twitter用户的数据，这些用户代表了意大利Twitter人口。数据集包括用户ID、30M条推文（2013年至2023年）、用户简介、个人资料图片、位置信息以及其他个人资料信息。目前只提供了脱水的推文ID和用户ID，研究人员可以通过联系作者获取完整数据集。

提供机构：

lorelupo

原始信息汇总

数据集概述

数据集名称

DADIT

数据集内容

用户数据：20K意大利Twitter用户的数据，代表意大利Twitter人口。
- 推文：30M条推文，时间跨度从2013年至2023年。
  - 包含元数据：创建时间、点赞数、转发数、是否为转发。
- 用户信息：
  - 个人简介（bios）
  - 头像（profile pictures）
  - 位置信息
  - 其他信息：注册日期、推文数量、关注者数量、粉丝数量等。

数据集用途

任务类别：
- 文本分类
- 零样本分类
- 特征提取
语言：意大利语
标签：Twitter、人口统计、性别、年龄、意大利语

数据集访问

目前仅分享脱水后的推文ID和用户ID。
研究者需联系数据集维护者以获取完整数据集。

联系方式

联系人：
- Lorenzo Lupo
- Paul Bose
- Carlo Schwarz
联系邮箱：
- lorenzo.lupo2@unibocconi.it
- paul.bose@unibocconi.it
- carlo.schwarz@unibocconi.it

引用信息

引用格式：

@inproceedings{lupo-etal-2024-dadit-dataset, title = "{DADIT}: A Dataset for Demographic Classification of {I}talian {T}witter Users and a Comparison of Prediction Methods", author = "Lupo, Lorenzo and Bose, Paul and Habibi, Mahyar and Hovy, Dirk and Schwarz, Carlo", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.386", pages = "4322--4332", abstract = "Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don{}t leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53{%} F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集