wwydmanski/blog-feedback
收藏Hugging Face2023-02-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wwydmanski/blog-feedback
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- tabular-regression
- tabular-classification
tags:
- tabular
size_categories:
- 10K<n<100K
---
## Source
Source: [UCI](https://archive.ics.uci.edu/ml/datasets/BlogFeedback)
## Data Set Information:
This data originates from blog posts. The raw HTML-documents
of the blog posts were crawled and processed.
The prediction task associated with the data is the prediction
of the number of comments in the upcoming 24 hours. In order
to simulate this situation, we choose a basetime (in the past)
and select the blog posts that were published at most
72 hours before the selected base date/time. Then, we calculate
all the features of the selected blog posts from the information
that was available at the basetime, therefore each instance
corresponds to a blog post. The target is the number of
comments that the blog post received in the next 24 hours
relative to the basetime.
In the train data, the basetimes were in the years
2010 and 2011. In the test data the basetimes were
in February and March 2012. This simulates the real-world
situtation in which training data from the past is available
to predict events in the future.
The train data was generated from different basetimes that may
temporally overlap. Therefore, if you simply split the train
into disjoint partitions, the underlying time intervals may
overlap. Therefore, the you should use the provided, temporally
disjoint train and test splits in order to ensure that the
evaluation is fair.
## Attribute Information:
1...50:Average, standard deviation, min, max and median of them attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared.
For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10
51: Total number of comments before basetime
52: Number of comments in the last 24 hours before the
basetime
53: Let T1 denote the datetime 48 hours before basetime,
Let T2 denote the datetime 24 hours before basetime.
This attribute is the number of comments in the time period
between T1 and T2
54: Number of comments in the first 24 hours after the
publication of the blog post, but before basetime
55: The difference of Attribute 52 and Attribute 53
56...60:
The same features as the attributes 51...55, but
features 56...60 refer to the number of links (trackbacks),
while features 51...55 refer to the number of comments.
61: The length of time between the publication of the blog post
and basetime
62: The length of the blog post
63...262:
The 200 bag of words features for 200 frequent words of the
text of the blog post
263...269: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the basetime
270...276: binary indicator features (0 or 1) for the weekday
(Monday...Sunday) of the date of publication of the blog
post
277: Number of parent pages: we consider a blog post P as a
parent of blog post B, if B is a reply (trackback) to
blog post P.
278...280:
Minimum, maximum, average number of comments that the
parents received
281: The target: the number of comments in the next 24 hours
(relative to basetime)
提供机构:
wwydmanski
原始信息汇总
数据集概述
任务类别
- 表格回归(tabular-regression)
- 表格分类(tabular-classification)
标签
- 表格(tabular)
数据集大小
- 数据量介于10,000至100,000之间
数据来源
- 来源:UCI
数据集信息
- 数据源自博客文章,原始HTML文档被爬取并处理。
- 预测任务是预测未来24小时内博客文章的评论数量。
- 训练数据的基准时间设定在2010年和2011年,测试数据的基准时间设定在2012年2月和3月。
属性信息
- 1至50:源博客文章的属性51至60的平均值、标准差、最小值、最大值和中间值。
- 51:基准时间之前的总评论数。
- 52:基准时间前24小时内的评论数。
- 53:基准时间前48小时至24小时内的评论数。
- 54:博客文章发布后至基准时间前24小时内的评论数。
- 55:属性52与属性53的差值。
- 56至60:与属性51至55相同,但针对链接(trackbacks)而非评论。
- 61:博客文章发布时间至基准时间的时间长度。
- 62:博客文章的长度。
- 63至262:博客文章文本中200个常用词的词袋特征。
- 263至276:基准时间和发布日期的工作日二进制指示器(0或1)。
- 277:父页面的数量(作为回复的博客文章)。
- 278至280:父页面收到的评论的最小值、最大值和平均值。
- 281:目标变量,即基准时间后24小时内的评论数。



