five

天猫复购预测dd

收藏
阿里云天池2026-05-30 更新2024-03-07 收录
下载链接:
https://tianchi.aliyun.com/dataset/143958
下载链接
链接失效反馈
官方服务:
资源简介:
dsa天猫复购预测数据集### 2.3 XGBoost的缺点 在LightGBM提出之前,最有名的GBDT工具就是XGBoost了,它是基于预排序方法的决策树算法。这种构建决策树的算法基本思想是:首先,对所有特征都按照特征的数值进行预排序。其次,在遍历分割点的时候用O(#data)的代价找到一个特征上的最好分割点。最后,在找到一个特征的最好分割点后,将数据分裂成左右子节点。 这样的预排序算法的优点是能精确地找到分割点。但是缺点也很明显:首先,空间消耗大。这样的算法需要保存数据的特征值,还保存了特征排序的结果(例如,为了后续快速的计算分割点,保存了排序后的索引),这就需要消耗训练数据两倍的内存。其次,时间上也有较大的开销,在遍历每一个分割点的时候,都需要进行分裂增益的计算,消耗的代价大。最后,对cache优化不友好。在预排序后,特征对梯度的访问是一种随机访问,并且不同的特征访问的顺序不一样,无法对cache进行优化。同时,在每一层长树的时候,需要随机访问一个行索引到叶子索引的数组,并且不同特征访问的顺序也不一样,也会造成较大的cache miss。

DSA Tmall Repeat Purchase Prediction Dataset### 2.3 Drawbacks of XGBoost Before the advent of LightGBM, the most widely recognized GBDT tool was XGBoost, a decision tree algorithm based on the presorting method. The basic idea of this decision tree construction algorithm is as follows: First, pre-sort all features according to their numerical values. Second, when traversing split points, find the optimal split point for a certain feature at the cost of O(#data). Finally, once the optimal split point for a feature is found, split the data into left and right child nodes. The presorting algorithm has the advantage of accurately locating optimal split points. However, its drawbacks are quite obvious: First, it has high space consumption. Such an algorithm needs to store both the feature values of the training data and the results of feature sorting (for example, sorted indices are saved to quickly calculate split points subsequently), which requires twice the memory of the original training data. Second, it incurs significant time overhead. When traversing each split point, split gain calculation needs to be performed, resulting in high computational cost. Finally, it is not cache-friendly. After presorting, the access of features to gradients is random access, and the access order varies across different features, making cache optimization impossible. At the same time, when growing trees layer by layer, it is necessary to randomly access an array that maps row indices to leaf indices, and the access order differs across different features, which also causes severe cache misses.
提供机构:
阿里云天池
创建时间:
2023-01-04
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集名为天猫复购预测dd,主要用于复购预测任务。数据文件包括多个CSV格式的训练、测试和用户信息文件,以及用户日志的压缩包。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务