Long-CoDe: A Longitudinal Twitter Dataset of Depression in the COVID Era

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7540405

下载链接

链接失效反馈

官方服务：

资源简介：

# Long-CoDe: A Longitudinal Twitter Dataset of Depression in the COVID Era ## Content of Long-CoDe Long-CoDe contains 15 million tweets from 597 depressed users and 804 control users. The temporal span of collected tweets is from Jan. 1st, 2019 to Dec. 31, 2022. We do not have the right to release text of the tweets due to privacy issues. We include user-ids, tweet-ids, date of posting, and label of the users in this dataset. We disguised user-ids with C00n and D00n to protect privacy and to represent their group labels. Tweets can be accessed with tweet-ids from the Twitter-APIs. The format of the data is like follows: | user-id | tweet-id | label | | ------- | -------- | -------------- | | C001| xxxxxxxx | 0 (Control) | | D001| xxxxxxxx | 1 (Depression) | ## Quality All users in both depressed and control groups posted steadily from 2019 to 2022. Control users were selected from 43 trending topics in 2020 evenly from Feb. to Dec. with broad interests. ## Structure The Long-CoDe dataset is a single CSV file with tabular data. The format of the CSV file is as follows: | user_id | tweet_id | label | ------- | -------- | -------------- | | C001| xxxxxxxx | 0 (Control) | | D001| xxxxxxxx | 1 (Depression) | ## Potential uses of the dataset General analysis of tweets from the labeled depressed users, compared with the control. Benchmark binary user-classification with contents of tweets or other extracted features. Analyze the impact of Covid-19 on both depression and control groups in multiple phases of the pandemic. ## Data Collection The data collection procedure contains four steps: (a) identifying depressed users, (b) selecting control users, (c) collecting tweets in 2020 and identifying regular users, and (d) collecting tweets from 2019 to 2022 for all selected users. ### Identifying Depressed Users To identify targeted depressed users, we crawled tweets containing depression self-claims, such as ”I am/was/have been diagnosed with depression”, using regular expressions from Feb 2020 to Dec 2020,when Covid-19 had a significant influence on our society. The data was manually annotated, and keep the users who did a valid self-claim on depression diagnosis. ### Identifying Control User In an attempt to create a diverse control user population, we gathered users that have shown interests in a wide array of trending topics, so that this would result in users with different backgrounds. From that, we randomly selected 6,200 unique users, and filtered out the users who ever made depression claims to give us a total of 5,929 control users. ### Identifying Regular Users For each user from both groups, we crawled the tweets (without re-tweets) from Feb. to Dec. in 2020 and analyzed their posting behaviors. We keep the users who have stable posting behaviors. (posting between 75 and 205 tweets per month) ### Collecting tweets from 2019 to 2022 We crawled all tweets for each of the identified regular users for four years from Jan., 2019 to Dec., 2022.

创建时间：

2023-01-16