five

The RepLab 2014 Dataset

收藏
Research Data Australia2024-12-14 收录
下载链接:
https://researchdata.edu.au/replab-2014-dataset/1307635
下载链接
链接失效反馈
官方服务:
资源简介:
RepLab 2014 used Twitter data in English and Spanish. For the reputation dimensions task, the data set is the same as in Replab 2013. This corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. Replab 2014 will use only the automotive and banking subsets. Crawling was performed during the period from the 1st June 2012 to the 31st Dec 2012 using the entity’s canonical name as query. For each entity, at least 2,200 tweets were collected: at least 700 tweets at the beginning of the timeline are used as training set, and at least 1,500 last tweets are reserved for the test set. The corpus also comprises additional background tweets for each entity (up to 50,000, with a large variability across entities). Note that the final amount of available tweets in these sets may be lower, since some posts may have been deleted or made private by the authors: in order to respect Twitter’s terms of service, the organizers do not provide the contents of the tweets. The tweet identifiers can be used to retrieve the texts of the posts similarly to the mechanism used in the TREC Microblog Track in 2011 and 2012. Each tweet is categorized into one of the following reputation dimensions: Products/Services, Innovation, Workplace, Citizenship, Governance, Leadership, Performance and Undefined. For the author profiling task, the data set consists of over 8,000 Twitter profiles (all with at least 1,000 followers) related to the automotive and banking domains. Each profile consists of (i) author name; (ii) profile URL and (iii) the last 600 tweets published by the author at crawling time. Reputation experts will manually identify the opinion makers (i.e. authors with reputational influence) and annotate them as “Influencer”. All those profiles that are not considered opinion makers will be assigned the “Non-Influencer” label. In case a profile cannot be classified into one of these categories, it will be labelled as “Undecidable”. Each opinion maker will be categorized as journalist, professional, authority, activist, investor, company, or celebrity. The data set will be split into training and test sets. The estimatated proportion is 30% and 70% respectively, although the exact splits will be given later.
提供机构:
RMIT University, Australia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作