A Twitter dataset (with labels) for Life-Event detection
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5910363
下载链接
链接失效反馈官方服务:
资源简介:
The file contains an anonymized version of the dataset collected in the framework of the “Tsundoku” project financed by the Autonomous Province of Trento according to the province law 13 of December 1999, n. 6 (and subsequent modifications), art. 5. - “Aids for promoting research and development”, financing approved with APIAE manager’s provision n. 691. The purpose of this project is training a Deep Learning (DL) model capable of detecting the occurrence of a so-called Life Event - a wedding and/or the birth of a child - in a person’s life on the basis of the contents she shared on social media (Twitter, in this case).
More precisely, the dataset consists of the most recent tweets - up to approximately 3200 for each account - of 8 Italian and 27 English-speaking users (all of them randomly picked), totalling 74722 tweets, 20302 written in Italian and 54420 in English.
For each user (labelled as ‘user_x’ with x an integer number between 1 and 35), the file includes the language her tweets are written in as well as the list of said tweets. For every one of the latter, the information made available includes the text of the tweet (appropriately modified as explained below), its length, the number of hashtags and of user mentions it featured, the number of retweets and of likes it received, two Boolean flags (“True” or “False”) assessing whether it was a quote or a reply and a label (‘birth’, ‘wedding’ and ‘not Life-Event-related’, depending on whether the tweet refers to a birth/wedding experienced by the user or not). With respect to the label, it is worth stressing that a tweet was labelled as ‘birth’/‘wedding’ only if the event “actively” involved the user (thus, a tweet reading “Today I get married” is labelled as ‘wedding’, while the label ‘not Life-Event-related’ is associated with ‘Today my sister gets married’) and if the tweet is by itself unambiguous (in other words, a tweet reading ‘My wife is pregnant’ is labelled as ‘birth’, while ‘Josephine is pregnant’ does not, since the tweet alone does not allow to determine who Josephine exactly is - this could perhaps be inferred from other tweets but this kind of contextualization is very hard to be carried out and was out of the scope of this project.).
In order to comply with GDPR, each one of the texts included in the file was obtained from the original text after taking the following anonymization steps:
every web link got replaced by the ‘WEBLINK’ string;
every mention to another Twitter user (for instance, “@joedoe”) got replaced by the “OTHERUSER” string;
every hashtag got replaced by the “HASHTAG” string;
every name and surname detected by the spaCy library (see https://spacy.io for more info) got replaced by the “NAME/SURNAME” string;
10% of the available words got randomly picked and erased.
创建时间:
2022-01-27



