Criteo live traffic data (Sample of 30 days of Criteo live traffic data)
收藏OpenDataLab2026-05-31 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Criteo_live_traffic_data
下载链接
链接失效反馈官方服务:
资源简介:
该数据集的内容 该数据集包括以下文件: README.md criteo_attribution_dataset.tsv.gz:数据集本身(623M 压缩) Experiments.ipynb:带有代码和实用程序的 ipython 笔记本,用于重现论文中的结果。也可以作为进一步研究此数据的起点。它需要 python 3.* 和标准科学库,例如 pandas、numpy 和 sklearn。数据描述 该数据集代表 30 天的 Criteo 实时交通数据样本。每行对应于向用户显示的一次展示(横幅)。对于每个横幅,我们都有关于上下文的详细信息,如果它被点击,它是否导致了转化,以及它是否导致了归因于 Criteo 的转化。数据已被二次抽样和匿名化,以免泄露专有元素。以下是字段的详细说明(它们在文件中以制表符分隔): 时间戳:展示的时间戳(从 0 开始表示第一次展示)。数据集根据时间戳排序。 uid 唯一用户标识符 广告活动 广告活动的唯一标识符 转化 1 如果在展示后的 30 天内发生了转化(与此印象是否是最后一次点击无关) conversion_timestamp 转化的时间戳或 -1 如果没有转化观察到的 conversion_id 是每个转换的唯一标识符(以便在需要时可以重建时间线)。 -1 如果没有转化归因 1 如果转化归因于 Criteo,0 否则点击 1 如果印象被点击,0 否则 click_pos 转化前的点击位置(0 表示首次点击) click_nb 点击次数。超过 1 如果在转换之前有几次点击,则 Criteo 为该显示支付的价格(免责声明:不是实际价格,只是它的转换版本) cpo 在归因转换的情况下为每个订单的成本(免责声明:不是实际价格,只是它的转换版本) time_since_last_click 自上次点击以来的时间(以秒为单位),用于与显示关联的给定印象 cat[1-9] 上下文特征。可用于学习点击/转化模型。我们没有透露这些特征的含义,但与本研究无关。每列都是一个分类变量。在实验中,它们使用散列技巧映射到固定维度空间(参见论文)。关键数据 2,4Gb 未压缩 16.5M 印象 45K 转化 700 个活动 任务 该数据集可用于与实时投标相关的大量应用,包括但不限于: 归因建模:基于规则、基于模型等……展示广告中的转换建模:数据包括用于计算效用指标的成本和价值。实时出价的离线指标
This dataset includes the following files:
1. README.md
2. criteo_attribution_dataset.tsv.gz: the dataset itself (623 MB when compressed)
3. Experiments.ipynb: an IPython notebook with code and utilities to reproduce the results from the accompanying paper, and can also serve as a starting point for further research on this dataset. It requires Python 3.* and standard scientific libraries such as pandas, numpy, and sklearn.
## Dataset Description
This dataset is a sample of real-time traffic data from Criteo spanning 30 days. Each row corresponds to a single banner display shown to a user. For each display, we provide detailed contextual information, whether the display was clicked, whether it resulted in a conversion within 30 days after the impression (regardless of whether this impression was the last click), and whether the conversion was attributed to Criteo. The data has been subsampled and anonymized to avoid disclosing proprietary elements. Below is a detailed explanation of each tab-separated field in the file:
- Timestamp: The timestamp of the display, with 0 representing the first impression. The dataset is sorted by timestamp.
- uid: Unique user identifier
- campaign: Unique identifier of the advertising campaign
- conversion: 1 if a conversion occurred within 30 days after the display, regardless of whether this impression was the last click; 0 otherwise
- conversion_timestamp: Timestamp of the conversion, or -1 if no conversion was observed
- conversion_id: Unique identifier for each conversion, allowing timeline reconstruction if needed; -1 if no conversion occurred
- attributed: 1 if the conversion is attributed to Criteo, 0 otherwise
- click: 1 if the impression was clicked, 0 otherwise
- click_pos: Position of the click prior to the conversion, with 0 indicating the first click
- click_nb: Number of clicks prior to the conversion
- cost: The price Criteo paid for this display (disclaimer: this is not the actual price, only a transformed version thereof)
- cpo: Cost per order for attributed conversions (disclaimer: this is not the actual price, only a transformed version thereof)
- time_since_last_click: Time in seconds since the last click associated with the given impression
- cat[1-9]: Contextual features that can be used to train click or conversion models. The specific meaning of these features is not disclosed, as they are irrelevant to this study. Each column is a categorical variable, which is mapped to a fixed-dimensional space using the hash trick in experiments (see the paper for details).
## Key Statistics
- 2.4 GB when uncompressed
- 16.5 million impressions
- 45,000 conversions
- 700 advertising campaigns
## Applicable Tasks
This dataset can be used for a wide range of applications related to real-time bidding, including but not limited to:
1. Attribution modeling: rule-based, model-based, etc.
2. Conversion modeling in display advertising: the dataset includes costs and values for calculating utility metrics
3. Offline metrics for real-time bidding
提供机构:
OpenDataLab
创建时间:
2022-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是Criteo提供的30天实时流量数据样本,包含展示、点击、转化等匿名化字段,用于广告点击率预估和归因建模等应用。数据规模为未压缩2.4GB,涵盖1650万次展示和4.5万次转化。
以上内容由遇见数据集搜集并总结生成



