Bike Sharing Dataset
收藏自行车共享数据科学项目
关于自行车共享数据集
概述
自行车共享系统是一种新一代的传统自行车租赁系统,整个过程从会员资格、租赁到归还都已实现自动化。通过这些系统,用户能够轻松地从特定位置租用自行车并在另一个位置归还。目前,全球约有超过500个自行车共享项目,由超过50万辆自行车组成。由于这些系统在交通、环境和健康问题中的重要作用,人们对它们产生了极大的兴趣。
除了自行车共享系统在现实世界中的有趣应用外,这些系统生成的数据特性也使其对研究具有吸引力。与其他交通服务(如公交或地铁)相比,这些系统中旅行时长、出发和到达位置被明确记录。这一特性使得自行车共享系统成为可以用于感知城市移动性的虚拟传感器网络。因此,预计通过监控这些数据可以检测到城市中的大多数重要事件。
属性信息
hour.csv 和 day.csv 包含以下字段,除了 hr 在 day.csv 中不可用:
instant: 记录索引dteday: 日期season: 季节 (1: 春季, 2: 夏季, 3: 秋季, 4: 冬季)yr: 年份 (0: 2011, 1: 2012)mnth: 月份 (1 到 12)hr: 小时 (0 到 23)holiday: 是否为假日weekday: 星期几workingday: 如果不是周末或假日则为1,否则为0weathersit: 天气状况- 1: 晴朗, 少云, 部分多云, 部分多云
- 2: 雾 + 多云, 雾 + 破碎云, 雾 + 少云, 雾
- 3: 小雪, 小雨 + 雷暴 + 分散云, 小雨 + 分散云
- 4: 大雨 + 冰雹 + 雷暴 + 雾, 雪 + 雾
temp: 标准化摄氏温度。值通过 (t-t_min)/(t_max-t_min) 计算,t_min=-8, t_max=+39 (仅在小时尺度上)atemp: 标准化体感摄氏温度。值通过 (t-t_min)/(t_max-t_min) 计算,t_min=-16, t_max=+50 (仅在小时尺度上)hum: 标准化湿度。值除以100(最大值)windspeed: 标准化风速。值除以67(最大值)casual: 临时用户数量registered: 注册用户数量cnt: 总租赁自行车数量,包括临时用户和注册用户
描述性分析
数据集分为训练、验证和测试集:
python dataloader = Dataloader(Bike-Sharing-Dataset/hour.csv) train, val, test = dataloader.getData() fullData = dataloader.getFullData()
category_features = [season, holiday, mnth, hr, weekday, workingday, weathersit] number_features = [temp, atemp, hum, windspeed]
features = category_features + number_features target = [cnt]
获取数据框的列名:
python print(list(fullData.columns))
打印数据集的前两行以探索数据:
python print(fullData.head(2))
获取每个列的数据统计信息:
python print(fullData[number_features].describe()) print(fullData[category_features].astype(category).describe())
缺失值分析
检查数据中是否存在NULL值:
python print(fullData.isnull().any())
异常值分析
箱线图
python fig, axes = plt.subplots(nrows=3, ncols=2) fig.set_size_inches(15, 15) sns.boxplot(data=train, y="cnt", orient="v", ax=axes[0][0]) sns.boxplot(data=train, y="cnt", x="mnth", orient="v", ax=axes[0][1]) sns.boxplot(data=train, y="cnt", x="weathersit", orient="v", ax=axes[1][0]) sns.boxplot(data=train, y="cnt", x="workingday", orient="v", ax=axes[1][1]) sns.boxplot(data=train, y="cnt", x="hr", orient="v", ax=axes[2][0]) sns.boxplot(data=train, y="cnt", x="temp", orient="v", ax=axes[2][1])
axes[0][0].set(ylabel=Count, title="Box Plot On Count") axes[0][1].set(xlabel=Month, ylabel=Count, title="Box Plot On Count Across Months") axes[1][0].set(xlabel=Weather Situation, ylabel=Count, title="Box Plot On Count Across Weather Situations") axes[1][1].set(xlabel=Working Day, ylabel=Count, title="Box Plot On Count Across Working Day") axes[2][0].set(xlabel=Hour Of The Day, ylabel=Count, title="Box Plot On Count Across Hour Of The Day") axes[2][1].set(xlabel=Temperature, ylabel=Count, title="Box Plot On Count Across Temperature")
移除异常值
python sns.distplot(train[target[-1]]);
print("Samples in train set with outliers: {}".format(len(train))) q1 = train.cnt.quantile(0.25) q3 = train.cnt.quantile(0.75) iqr = q3 - q1 lower_bound = q1 - (1.5 * iqr) upper_bound = q3 + (1.5 * iqr) train_preprocessed = train.loc[(train.cnt >= lower_bound) & (train.cnt <= upper_bound)] print("Samples in train set without outliers: {}".format(len(train_preprocessed))) sns.distplot(train_preprocessed.cnt);
相关性分析
python matrix = train[number_features + target].corr() heat = np.array(matrix) heat[np.tril_indices_from(heat)] = False fig, ax = plt.subplots() fig.set_size_inches(20, 10) sns.heatmap(matrix, mask=heat, vmax=1.0, vmin=0.0, square=True, annot=True, cmap="Reds")
模型选择
概述指标
- 均方误差 (MSE)
- 均方根对数误差 (RMSLE)
- R² 分数
模型评估
python x_train = train_preprocessed[features].values y_train = train_preprocessed[target].values.ravel() val = val.sort_values(by=target) x_val = val[features].values y_val = val[target].values.ravel() x_test = test[features].values
table = PrettyTable() table.field_names = ["Model", "Mean Squared Error", "R² score"]
models = [ SGDRegressor(max_iter=1000, tol=1e-3), Lasso(alpha=0.1), ElasticNet(random_state=0), Ridge(alpha=.5), SVR(gamma=auto, kernel=linear), SVR(gamma=auto, kernel=rbf), BaggingRegressor(), BaggingRegressor(KNeighborsClassifier(), max_samples=0.5, max_features=0.5), NuSVR(gamma=auto), RandomForestRegressor(random_state=0, n_estimators=300) ]
for model in models: model.fit(x_train, y_train) y_res = model.predict(x_val)
mse = mean_squared_error(y_val, y_res)
score = model.score(x_val, y_val)
table.add_row([type(model).__name__, format(mse, .2f), format(score, .2f)])
print(table)
随机森林
随机森林模型
python table = PrettyTable() table.field_names = ["Model", "Dataset", "MSE", RMSLE, "R² score"] model = RandomForestRegressor(random_state=0, n_estimators=100) model.fit(x_train, y_train)
def evaluate(x, y, dataset): pred = model.predict(x)
mse = mean_squared_error(y, pred)
score = model.score(x, y)
rmsle = np.sqrt(mean_squared_log_error(y, pred))
table.add_row([type(model).__name__, dataset, format(mse, .2f), format(rmsle, .2f), format(score, .2f)])
evaluate(x_train, y_train, training) evaluate(x_val, y_val, validation)
print(table)
特征重要性
python importances = model.feature_importances_ std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0) indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(x_val.shape[1]): print("%d. feature %s (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))
plt.figure(figsize=(18, 5)) plt.title("Feature importances") plt.bar(range(x_val.shape[1]), importances[indices], color="cornflowerblue", yerr=std[indices], align="center") plt.xticks(range(x_val.shape[1]), [features[i] for i in indices]) plt.xlim([-1, x_val.shape[1]]) plt.show()




