Socioeconomic status classification of social media users

Name: Socioeconomic status classification of social media users
Creator: figshare
Published: 2025-05-01 07:15:03
License: 暂无描述

DataCite Commons2025-05-01 更新2024-07-25 收录

下载链接：

https://figshare.com/articles/dataset/Socioeconomic_status_classification_of_social_media_users/1619703/2

下载链接

链接失效反馈

官方服务：

资源简介：

This data set accompanies the following paper:Vasileios Lampos, Nikolaos Aletras, Gens Jeyti, Bin Zou and Ingemar J. Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language. Proceedings of the 38th European Conference on Information Retrieval (ECIR), 2016. Data description - Temporal resolution: February 1, 2014 to March 21, 2015- data_matrix.csv: Main input file. Each line represents a user (1342 users in total). See below for the interpretation of the dimensions (columns) related to textual content. Dimensions 1284 to 1287 contain the ratios of user replies, mentions (of other accounts), retweets (of tweets from other accounts) and unique mentions (of other accounts) over the total number of tweets of a particular user, respectively. Dimensions 1288 to 1291 contain the log-number of followers+1, followees+1, listings+1 and the impact score for a particular user. The definition of the impact score has been adopted from the following paper: V. Lampos, N. Aletras, D. Preotiuc-Pietro and T. Cohn. Predicting and Characterising User Impact on Twitter. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 405–413, 2014.- sec_labels.txt: Socioeconomic status class labels for each user; 1,2 and 3 denote the upper, middle and lower socioeconomic classes respectively. Each line of sec_label.txt corresponds to a line of data_matrix.csv.- voc_1grams.txt: Vocabulary index of frequent 1-grams extracted from the users' tweets. Represents dimensions 1 to 560 from data_matrix.csv.- voc_bio_1grams.txt: Vocabulary index of 1-grams in the bio description of the users. Represents dimensions 561 to 786 from data_matrix.csv.- voc_bio_2grams.txt: Vocabulary index of 2-grams in the bio description of the users. Represents dimensions 787 to 1083 from data_matrix.csv.- voc_clusters.txt: Vocabulary index used in the formation of clusters.- voc_clusters_ids.csv: Each line contains the 1-gram ids (line numbers) from voc_clusters.txt that are members of a cluster. In total we have derived 200 clusters, represented by dimensions 1084 to 1283 in data_matrix.csv.

本数据集配套以下论文：瓦西利奥斯·兰波斯（Vasileios Lampos）、尼古拉斯·阿莱塔斯（Nikolaos Aletras）、根斯·杰伊蒂（Gens Jeyti）、邹斌（Bin Zou）以及英格玛·J·考克斯（Ingemar J. Cox）。**《基于行为与语言推断社交媒体用户的社会经济地位》**，发表于《第38届欧洲信息检索大会（European Conference on Information Retrieval, ECIR）论文集》，2016年。 **数据说明** - 数据采集时间跨度：2014年2月1日至2015年3月21日 - **`data_matrix.csv`**：主输入文件。每行对应一名用户（总计1342名用户）。下文将说明与文本内容相关的维度（列）的含义。第1284至1287列分别为：用户回复、提及其他账号、转发（retweets）其他账号推文，以及唯一提及其他账号的次数占该用户总推文数的比例。第1288至1291列分别为：（粉丝数+1）、（关注数+1）、（列表数+1）的对数值，以及该用户的影响力得分（impact score）。影响力得分的定义引自以下论文：V. Lampos、N. Aletras、D. Preotiuc-Pietro与T. Cohn。**《预测并刻画Twitter用户的影响力》**，发表于《第14届欧洲计算语言学协会分会会议（Association for Computational Linguistics European Chapter, EACL）论文集》，第405–413页，2014年。 - **`sec_labels.txt`**：每名用户的社会经济地位类别标签；1、2、3分别代表上层、中层与下层社会经济阶层。`sec_labels.txt`的每行与`data_matrix.csv`的每行一一对应。 - **`voc_1grams.txt`**：从用户推文中提取的高频一元词（1-grams）词汇索引表，对应`data_matrix.csv`的第1至560列。 - **`voc_bio_1grams.txt`**：用户个人简介中的一元词（1-grams）词汇索引表，对应`data_matrix.csv`的第561至786列。 - **`voc_bio_2grams.txt`**：用户个人简介中的二元词（2-grams）词汇索引表，对应`data_matrix.csv`的第787至1083列。 - **`voc_clusters.txt`**：用于构建簇的词汇索引表。 - **`voc_clusters_ids.csv`**：每行包含`voc_clusters.txt`中属于某一簇的一元词ID（即行号）。本次数据集共生成200个簇，对应`data_matrix.csv`的第1084至1283列。

提供机构：

figshare

创建时间：

2015-12-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集