RepLab 2013
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/RepLab_2013
下载链接
链接失效反馈官方服务:
资源简介:
RepLab 2013 数据集使用英语和西班牙语的 Twitter 数据(超过 142,000 条推文)。两种语言之间的平衡取决于数据集中包含的每个实体的数据可用性。该语料库由一组推文组成,这些推文引用了来自四个领域的一组选定的 61 个实体:汽车、银行、大学和音乐/艺术家。进行域选择是为了为声誉研究提供各种场景。在 2012 年 6 月 1 日至 2012 年 12 月 31 日期间,使用实体的规范名称作为查询进行了爬网。对于每个实体,至少收集 2200 条推文:时间线开头的至少 700 条推文用作训练集,最后一条推文至少保留 1500 条用于测试集。语料库还包含每个实体的附加背景推文(最多 50,000 条推文,实体之间的差异很大)。以这种方式设置此分布以获得训练和测试数据之间的时间间隔(理想情况下为几个月)。请注意,这些集合中可用推文的最终数量可能较低,因为某些帖子可能已被用户删除:为了尊重 Twitter 的服务条款,我们不提供推文的内容。推文标识符可用于检索帖子的文本。我们提供了一个下载工具,它与 2011 年和 2012 年的 TREC 微博 Track 中使用的机制类似。更多信息请参阅 RepLab 2013 Overview 的论文。
The RepLab 2013 dataset utilizes English and Spanish Twitter data comprising over 142,000 tweets. The balance between the two languages is determined by the data availability for each entity included in the dataset. This corpus is composed of tweets referencing 61 selected entities across four domains: automotive, banking, universities, and music/artists. Domain selection was carried out to provide diverse scenarios for reputation research. Crawling was conducted using the canonical names of the entities as queries between June 1, 2012 and December 31, 2012. For each entity, a minimum of 2200 tweets were collected: at least 700 tweets from the start of the timeline are used as the training set, while at least 1500 of the final tweets are reserved for the test set. The corpus also contains additional background tweets for each entity (up to 50,000 tweets, with substantial variation across entities). This distribution was configured to create a temporal gap (ideally several months) between the training and test data. Note that the final number of available tweets in these sets may be lower, as some posts may have been deleted by users. To comply with Twitter’s Terms of Service, we do not provide the content of the tweets; tweet identifiers can be used to retrieve the text of the posts. We provide a download tool similar to the mechanisms used in the 2011 and 2012 TREC Microblog Tracks. For more information, please refer to the RepLab 2013 Overview paper.
提供机构:
OpenDataLab
创建时间:
2022-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
RepLab 2013是一个多语言Twitter数据集,包含61个实体来自四个不同领域的推文,用于声誉研究。数据集分为训练集和测试集,并提供了每个实体的背景推文,时间跨度为2012年6月至12月。
以上内容由遇见数据集搜集并总结生成



