lorelupo/dadit_italian_twitter
收藏数据集概述
数据集名称
- DADIT
数据集内容
- 用户数据:20K意大利Twitter用户的数据,代表意大利Twitter人口。
- 推文:30M条推文,时间跨度从2013年至2023年。
- 包含元数据:创建时间、点赞数、转发数、是否为转发。
- 用户信息:
- 个人简介(bios)
- 头像(profile pictures)
- 位置信息
- 其他信息:注册日期、推文数量、关注者数量、粉丝数量等。
- 推文:30M条推文,时间跨度从2013年至2023年。
数据集用途
- 任务类别:
- 文本分类
- 零样本分类
- 特征提取
- 语言:意大利语
- 标签:Twitter、人口统计、性别、年龄、意大利语
数据集访问
- 目前仅分享脱水后的推文ID和用户ID。
- 研究者需联系数据集维护者以获取完整数据集。
相关文献
- 论文:DADIT - A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods.
- 论文链接:DADIT - A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods.
联系方式
- 联系人:
- Lorenzo Lupo
- Paul Bose
- Carlo Schwarz
- 联系邮箱:
- lorenzo.lupo2@unibocconi.it
- paul.bose@unibocconi.it
- carlo.schwarz@unibocconi.it
引用信息
-
引用格式:
@inproceedings{lupo-etal-2024-dadit-dataset, title = "{DADIT}: A Dataset for Demographic Classification of {I}talian {T}witter Users and a Comparison of Prediction Methods", author = "Lupo, Lorenzo and Bose, Paul and Habibi, Mahyar and Hovy, Dirk and Schwarz, Carlo", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.386", pages = "4322--4332", abstract = "Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don{}t leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53{%} F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.", }



