天猫复购预测数据集
收藏阿里云天池2026-06-09 更新2024-03-07 收录
下载链接:
https://tianchi.aliyun.com/dataset/142832
下载链接
链接失效反馈官方服务:
资源简介:
天猫复购预测数据集
天猫复购预测自用,2.3 XGBoost的缺点
在LightGBM提出之前,最有名的GBDT工具就是XGBoost了,它是基于预排序方法的决策树算法。这种构建决策树的算法基本思想是:首先,对所有特征都按照特征的数值进行预排序。其次,在遍历分割点的时候用O(#data)的代价找到一个特征上的最好分割点。最后,在找到一个特征的最好分割点后,将数据分裂成左右子节点。
这样的预排序算法的优点是能精确地找到分割点。但是缺点也很明显:首先,空间消耗大。这样的算法需要保存数据的特征值,还保存了特征排序的结果(例如,为了后续快速的计算分割点,保存了排序后的索引),这就需要消耗训练数据两倍的内存。其次,时间上也有较大的开销,在遍历每一个分割点的时候,都需要进行分裂增益的计算,消耗的代价大。最后,对cache优化不友好。在预排序后,特征对梯度的访问是一种随机访问,并且不同的特征访问的顺序不一样,无法对cache进行优化。同时,在每一层长树的时候,需要随机访问一个行索引到叶子索引的数组,并且不同特征访问的顺序也不一样,也会造成较大的cache miss。
Tmall Repurchase Prediction Dataset
Section 2.3 Drawbacks of XGBoost (for Tmall repurchase prediction tasks)
Prior to the proposal of LightGBM, XGBoost was the most widely recognized GBDT tool, which is a decision tree algorithm based on the pre-sorting method. The fundamental workflow of this decision tree construction algorithm is: First, pre-sort all features according to their numerical values. Second, when traversing split points, find the optimal split point for a given feature at the computational cost of O(#data). Third, after identifying the optimal split point for a feature, split the dataset into left and right child nodes.
The pre-sorting algorithm boasts accurate split point detection, but its drawbacks are prominent: First, high memory overhead. The algorithm needs to store both the feature values of the training data and the results of feature sorting (e.g., the sorted indices for fast subsequent split point calculation), which consumes twice the memory of the original training data. Second, significant time overhead: when traversing each split point, split gain calculation must be conducted, leading to high computational costs. Third, poor cache optimization compatibility. After pre-sorting, the access of features to gradients is random, and the access order varies across different features, making cache optimization infeasible. Furthermore, when growing trees layer-wise, random access to an array mapping row indices to leaf indices is required, and the access order differs across features, which also causes substantial cache misses.
提供机构:
阿里云天池
创建时间:
2022-12-10
搜集汇总
数据集介绍

背景与挑战
背景概述
天猫复购预测数据集包含三个CSV文件,分别是用户信息、训练数据和测试数据,文件大小从4.34MB到733.05MB不等。数据集主要用于复购预测,但具体内容和应用场景未在详情中明确说明。
以上内容由遇见数据集搜集并总结生成



