Bike Sharing Dataset

github2023-12-30 更新2024-05-31 收录

下载链接：

https://github.com/pgebert/bike-sharing-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

自行车共享系统是一种新型的传统自行车租赁系统，整个过程从会员资格、租赁到归还都实现了自动化。通过这些系统，用户可以轻松地从一个特定位置租用自行车，并在另一个位置归还。目前，全球大约有500多个自行车共享项目，拥有超过50万辆自行车。由于自行车共享系统在交通、环境和健康问题中的重要作用，今天人们对这些系统产生了极大的兴趣。

Bike-sharing systems are a novel automated variant of traditional bicycle rental services, where the entire process including membership registration, bike rental and bike return has been fully automated. Through such systems, users can easily rent bicycles at a designated location and return them at another location. Currently, there are over 500 bike-sharing programs worldwide with a total fleet of more than 500,000 bicycles. Given the significant role of bike-sharing systems in addressing transportation, environmental and public health concerns, there has been substantial growing interest in these systems nowadays.

创建时间：

2019-02-22

原始信息汇总

自行车共享数据科学项目

关于自行车共享数据集

概述

自行车共享系统是一种新一代的传统自行车租赁系统，整个过程从会员资格、租赁到归还都已实现自动化。通过这些系统，用户能够轻松地从特定位置租用自行车并在另一个位置归还。目前，全球约有超过500个自行车共享项目，由超过50万辆自行车组成。由于这些系统在交通、环境和健康问题中的重要作用，人们对它们产生了极大的兴趣。

除了自行车共享系统在现实世界中的有趣应用外，这些系统生成的数据特性也使其对研究具有吸引力。与其他交通服务（如公交或地铁）相比，这些系统中旅行时长、出发和到达位置被明确记录。这一特性使得自行车共享系统成为可以用于感知城市移动性的虚拟传感器网络。因此，预计通过监控这些数据可以检测到城市中的大多数重要事件。

属性信息

hour.csv 和 day.csv 包含以下字段，除了 hr 在 day.csv 中不可用：

instant: 记录索引
dteday: 日期
season: 季节 (1: 春季, 2: 夏季, 3: 秋季, 4: 冬季)
yr: 年份 (0: 2011, 1: 2012)
mnth: 月份 (1 到 12)
hr: 小时 (0 到 23)
holiday: 是否为假日
weekday: 星期几
workingday: 如果不是周末或假日则为1，否则为0
weathersit: 天气状况
- 1: 晴朗, 少云, 部分多云, 部分多云
- 2: 雾 + 多云, 雾 + 破碎云, 雾 + 少云, 雾
- 3: 小雪, 小雨 + 雷暴 + 分散云, 小雨 + 分散云
- 4: 大雨 + 冰雹 + 雷暴 + 雾, 雪 + 雾
temp: 标准化摄氏温度。值通过 (t-t_min)/(t_max-t_min) 计算，t_min=-8, t_max=+39 (仅在小时尺度上)
atemp: 标准化体感摄氏温度。值通过 (t-t_min)/(t_max-t_min) 计算，t_min=-16, t_max=+50 (仅在小时尺度上)
hum: 标准化湿度。值除以100（最大值）
windspeed: 标准化风速。值除以67（最大值）
casual: 临时用户数量
registered: 注册用户数量
cnt: 总租赁自行车数量，包括临时用户和注册用户

描述性分析

数据集分为训练、验证和测试集：

python dataloader = Dataloader(Bike-Sharing-Dataset/hour.csv) train, val, test = dataloader.getData() fullData = dataloader.getFullData()

category_features = [season, holiday, mnth, hr, weekday, workingday, weathersit] number_features = [temp, atemp, hum, windspeed]

features = category_features + number_features target = [cnt]

获取数据框的列名：

python print(list(fullData.columns))

打印数据集的前两行以探索数据：

python print(fullData.head(2))

获取每个列的数据统计信息：

python print(fullData[number_features].describe()) print(fullData[category_features].astype(category).describe())

缺失值分析

检查数据中是否存在NULL值：

python print(fullData.isnull().any())

异常值分析

箱线图

python fig, axes = plt.subplots(nrows=3, ncols=2) fig.set_size_inches(15, 15) sns.boxplot(data=train, y="cnt", orient="v", ax=axes[0][0]) sns.boxplot(data=train, y="cnt", x="mnth", orient="v", ax=axes[0][1]) sns.boxplot(data=train, y="cnt", x="weathersit", orient="v", ax=axes[1][0]) sns.boxplot(data=train, y="cnt", x="workingday", orient="v", ax=axes[1][1]) sns.boxplot(data=train, y="cnt", x="hr", orient="v", ax=axes[2][0]) sns.boxplot(data=train, y="cnt", x="temp", orient="v", ax=axes[2][1])

axes[0][0].set(ylabel=Count, title="Box Plot On Count") axes[0][1].set(xlabel=Month, ylabel=Count, title="Box Plot On Count Across Months") axes[1][0].set(xlabel=Weather Situation, ylabel=Count, title="Box Plot On Count Across Weather Situations") axes[1][1].set(xlabel=Working Day, ylabel=Count, title="Box Plot On Count Across Working Day") axes[2][0].set(xlabel=Hour Of The Day, ylabel=Count, title="Box Plot On Count Across Hour Of The Day") axes[2][1].set(xlabel=Temperature, ylabel=Count, title="Box Plot On Count Across Temperature")

移除异常值

python sns.distplot(train[target[-1]]);

print("Samples in train set with outliers: {}".format(len(train))) q1 = train.cnt.quantile(0.25) q3 = train.cnt.quantile(0.75) iqr = q3 - q1 lower_bound = q1 - (1.5 * iqr) upper_bound = q3 + (1.5 * iqr) train_preprocessed = train.loc[(train.cnt >= lower_bound) & (train.cnt <= upper_bound)] print("Samples in train set without outliers: {}".format(len(train_preprocessed))) sns.distplot(train_preprocessed.cnt);

模型选择

概述指标

均方误差 (MSE)
均方根对数误差 (RMSLE)
R² 分数

模型评估

python x_train = train_preprocessed[features].values y_train = train_preprocessed[target].values.ravel() val = val.sort_values(by=target) x_val = val[features].values y_val = val[target].values.ravel() x_test = test[features].values

table = PrettyTable() table.field_names = ["Model", "Mean Squared Error", "R² score"]

models = [ SGDRegressor(max_iter=1000, tol=1e-3), Lasso(alpha=0.1), ElasticNet(random_state=0), Ridge(alpha=.5), SVR(gamma=auto, kernel=linear), SVR(gamma=auto, kernel=rbf), BaggingRegressor(), BaggingRegressor(KNeighborsClassifier(), max_samples=0.5, max_features=0.5), NuSVR(gamma=auto), RandomForestRegressor(random_state=0, n_estimators=300) ]

for model in models: model.fit(x_train, y_train) y_res = model.predict(x_val)

mse = mean_squared_error(y_val, y_res)
score = model.score(x_val, y_val)

table.add_row([type(model).__name__, format(mse, .2f), format(score, .2f)])

print(table)

随机森林

随机森林模型

python table = PrettyTable() table.field_names = ["Model", "Dataset", "MSE", RMSLE, "R² score"] model = RandomForestRegressor(random_state=0, n_estimators=100) model.fit(x_train, y_train)

def evaluate(x, y, dataset): pred = model.predict(x)

mse = mean_squared_error(y, pred)
score = model.score(x, y)
rmsle = np.sqrt(mean_squared_log_error(y, pred))

table.add_row([type(model).__name__, dataset, format(mse, .2f), format(rmsle, .2f), format(score, .2f)])

evaluate(x_train, y_train, training) evaluate(x_val, y_val, validation)

print(table)

特征重要性

python importances = model.feature_importances_ std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0) indices = np.argsort(importances)[::-1]

print("Feature ranking:")

for f in range(x_val.shape[1]): print("%d. feature %s (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))

plt.figure(figsize=(18, 5)) plt.title("Feature importances") plt.bar(range(x_val.shape[1]), importances[indices], color="cornflowerblue", yerr=std[indices], align="center") plt.xticks(range(x_val.shape[1]), [features[i] for i in indices]) plt.xlim([-1, x_val.shape[1]]) plt.show()

搜集汇总

数据集介绍

构建方式

Bike Sharing Dataset的构建源于全球范围内超过500个自行车共享系统的运营数据，这些系统通过自动化流程记录用户的租借和归还行为。数据集包含了从2011年至2012年的每小时和每日的自行车租借记录，涵盖了季节、天气、温度等多种环境因素。数据的收集通过传感器网络实现，确保了数据的实时性和准确性，为城市交通研究提供了宝贵的数据支持。

使用方法

Bike Sharing Dataset的使用方法主要包括数据加载、描述性分析、模型训练与评估等步骤。通过Python的Dataloader模块，用户可以轻松加载数据集并进行初步的数据探索。随后，利用Scikit-learn等机器学习库，用户可以进行特征选择、模型训练和性能评估。数据集特别适用于回归分析，常用的模型包括随机森林回归、支持向量回归等。通过分析数据集中的特征重要性，用户可以深入理解影响自行车租借量的关键因素，进而优化预测模型的性能。

背景与挑战

背景概述

Bike Sharing Dataset 是一个关于共享单车系统的数据集，旨在研究城市交通、环境和健康问题。该数据集由多个研究机构于2011年至2012年间创建，记录了华盛顿特区的共享单车使用情况。数据集包含每小时和每天的租赁记录，涵盖了季节、天气、温度等多个特征。该数据集的核心研究问题是通过分析共享单车的使用模式，预测未来的租赁需求，从而优化资源分配和城市规划。Bike Sharing Dataset 在交通研究、城市规划和机器学习领域具有重要影响力，为研究者提供了丰富的数据资源，推动了共享单车系统的智能化发展。

当前挑战

Bike Sharing Dataset 在解决共享单车需求预测问题时面临多重挑战。首先，数据集中存在高度相关的特征，如温度和体感温度，这增加了模型选择的复杂性。其次，数据分布不均衡，特别是在极端天气条件下，租赁量显著减少，导致模型在这些情况下的预测性能下降。此外，数据集中存在异常值，如极端天气或特殊事件导致的租赁量波动，这些异常值可能影响模型的泛化能力。在构建过程中，研究人员还需处理数据的缺失值和噪声，确保数据的完整性和准确性。最后，由于共享单车系统的动态性和城市环境的复杂性，模型的预测精度和鲁棒性仍需进一步提升。

常用场景

经典使用场景

Bike Sharing Dataset 在交通和城市规划研究中具有重要地位，常用于分析共享单车系统的使用模式。通过该数据集，研究者能够深入探讨不同时间、天气和季节因素对单车租赁量的影响，进而优化共享单车的调度和分配策略。

解决学术问题

该数据集解决了城市交通流量预测、共享资源优化配置以及环境可持续性研究中的关键问题。通过分析共享单车的使用数据，研究者能够预测高峰时段的租赁需求，减少资源浪费，并为城市交通规划提供数据支持。

实际应用

在实际应用中，Bike Sharing Dataset 被广泛用于共享单车运营商的决策支持系统。通过分析历史数据，运营商可以优化单车的投放位置和数量，提升用户体验，同时降低运营成本。此外，该数据集还为城市交通管理部门提供了数据支持，帮助其制定更有效的交通政策。

数据集最近研究