kimhz/TencentGR-10M
收藏Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/kimhz/TencentGR-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: candidate
data_files:
- split: train
path: candidate/**/*.parquet
- config_name: item_feat
data_files:
- split: train
path: item_feat/**/*.parquet
- config_name: seq
data_files:
- split: train
path: seq/**/*.parquet
- config_name: user_feat
data_files:
- split: train
path: user_feat/**/*.parquet
- config_name: mm_emb_81_32
data_files:
- split: train
path: mm_emb/emb_81_32_parquet/**/*.parquet
- config_name: mm_emb_82_1024
data_files:
- split: train
path: mm_emb/emb_82_1024_parquet/**/*.parquet
- config_name: mm_emb_83_3584
data_files:
- split: train
path: mm_emb/emb_83_3584_parquet/**/*.parquet
- config_name: mm_emb_84_32
data_files:
- split: train
path: mm_emb/emb_84_32_parquet/**/*.parquet
- config_name: mm_emb_85_3584
data_files:
- split: train
path: mm_emb/emb_85_3584_1210_parquet/**/*.parquet
- config_name: mm_emb_86_3584
data_files:
- split: train
path: mm_emb/emb_86_3584_1210_parquet/**/*.parquet
license: cc-by-4.0
---
# TencentGR-10M Dataset
TAAC2025 Second Round Dataset(2025年腾讯广告算法大赛复赛数据集) TencentGR-10M Dataset is a large-scale, all-modality dataset designed specifically for generative recommendation (GR) in industrial advertising. Similar to [TencentGR-1M](https://huggingface.co/datasets/TAAC2025/TencentGR-1M), it is constructed from real, de-identified Tencent Ads logs, and aims to address the lack of realistic, public multi-modal datasets in the GR field.
The main differences between TencentGR-10M and TencentGR-1M are:
- Dataset Size: Provides **10 million** user sequences, with each user sequence containing up to 100 interacted items.
- Labels: Each interaction within the sequence is explicitly labeled with **exposure(0)**, **click(1)**, and **conversion(2)** signals.
## Dataset Structure
### Overview
| Config Name | Path | Approx. Size | Description |
|---|---|---|---|
| `candidate` | `candidate/` | ~97 MB | Candidate item set |
| `item_feat` | `item_feat/` | ~348 MB | Item features |
| `seq` | `seq/` | ~9.8 GB | User behavior sequences |
| `user_feat` | `user_feat/` | ~88 MB | User features |
| `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~5.0 GB | Multimodal embedding (dim=32) |
| `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~36 GB | Multimodal embedding (dim=1024) |
| `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~116 GB | Multimodal embedding (dim=3584) |
| `mm_emb_84_32` | `mm_emb/emb_84_32_parquet/` | ~3.7 GB | Multimodal embedding (dim=32) |
| `mm_emb_85_3584` | `mm_emb/emb_85_3584_1210_parquet/` | ~116 GB | Multimodal embedding (dim=3584) |
| `mm_emb_86_3584` | `mm_emb/emb_86_3584_1210_parquet/` | ~100 GB | Multimodal embedding (dim=3584) |
### Additional Files
| File | Size | Description |
|---|---|---|
| `indexer.pkl` | ~503 MB | Index mapping file (From original ID to remapped ID) |
### Data Format
All data files (except `indexer.pkl`) are stored in **Snappy-compressed Parquet** format.
### Schema
For clarity and brevity, we provide detailed schema descriptions for each table below (**Same as TencentGR-1M**).
Need to notice that we use two types of IDs in the dataset: the original IDs and the remapped IDs, for simplicity, we will denote them as OID and RID. OIDs are used in `mm_emb`, and RIDs are used in all the training data and can be used for building models. The mapping between OIDs and RIDs can be found in the `indexer.pkl` file.
#### `item_feat`
The `item_feat` table contains the features of each item appeared in the `seq` set.
<!-- 15 x 3 table: -->
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | RID for each item. |4783154 |
| `100` | int64 | An encrypted feature. | 4779045 |
| `101` | int64 | An encrypted feature. | 4779045 |
| `102` | int64 | An encrypted feature. | 4735917 |
| `112` | int64 | An encrypted feature. | 4701740 |
| `114` | int64 | An encrypted feature. | 4778327 |
| `115` | int64 | An encrypted feature. | 1531415 |
| `116` | int64 | An encrypted feature. | 4778146 |
| `117` | int64 | An encrypted feature. | 4701740 |
| `118` | int64 | An encrypted feature. | 4700703 |
| `119` | int64 | An encrypted feature. | 4699894 |
| `120` | int64 | An encrypted feature. | 4694982 |
| `121` | int64 | An encrypted feature. | 4783154 |
| `122` | int64 | An encrypted feature. | 4779045 |
#### `user_feat`
The `user_feat` table contains the features of each user appeared in the dataset.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | RID for each user. | 1001845 |
| `103` | int64 | An encrypted feature. | 1000964 |
| `104` | int64 | An encrypted feature. | 998043 |
| `105` | int64 | An encrypted feature. | 859602 |
| `106` | List\[int64\] | An encrypted feature. | 880754 |
| `107` | List\[int64\] | An encrypted feature. | 387686 |
| `108` | List\[int64\] | An encrypted feature. | 170678 |
| `109` | int64 | An encrypted feature. | 1001467 |
| `110` | List\[int64\] | An encrypted feature. | 430598 |
#### `seq`
The `seq` table contains the behavior sequence for each user.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | RID for each user. | 1001845 |
| `seq` | List\[Dict\] | The behavior sequence for each user, each dict contains 3 keys: `item_id`(RID), `action_type`, and `timestamp`, where the values are all integers | 1001845 |
#### `candidate`
The `candidate` table contains the candidate items for the competition.
Note:
- This `candidate` is for the competition, but we do not provide the ground truth labels. People may refer to this format to build their own candidate set.
- The `candidate` contains some items that are not in the `seq`.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | OID for each item. | 660000 |
| `retrieval_id` | int64 | The remapped ID for faiss retrieval (Start from 0). | 660000 |
| `100` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
| `101` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
| `102` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 653852 |
| `112` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 |
| `114` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659093 |
| `115` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 195552 |
| `116` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659090 |
| `117` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 |
| `118` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654886 |
| `119` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654882 |
| `120` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654870 |
| `121` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 660000 |
| `122` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
#### `mm_emb`
The `mm_emb` tables contain the multimodal embeddings for each item. There are 6 different embedding dimensions(`[32, 1024, 3584, 4096, 3584, 3584]`) for 6 different embeddings(`[81, 82, 83, 84, 85, 86]`) placed in 6 files.
Take the `81` embedding as an example:
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `anonymous_cid` | string | OID for each item. | 4742961 |
| `emb` | List\[ double \] | Embedding for each item. | 4742961 |
#### `indexer.pkl`
This is a remapping file that maps the original IDs/Values to the remapped IDs/Values. The format is a dictionary with the following structure:
```json
{
"u":
{
OID: RID,
...
},
"i":
{
OID: RID,
...
},
"f":
{
101:
{
10100000000: 1, // original value: remapped value
10100000001: 2,
...
},
102:
{
1020000000: 1,
1020000001: 2,
...
},
...
}
}
```
## Usage
```python
from datasets import load_dataset
# Load a specific config
ds = load_dataset("TAAC2025/TencentGR-10M", name="candidate", split="train")
# Load item features
ds_item = load_dataset("TAAC2025/TencentGR-10M", name="item_feat", split="train")
# Load user behavior sequences
ds_seq = load_dataset("TAAC2025/TencentGR-10M", name="seq", split="train")
# Load user features
ds_user = load_dataset("TAAC2025/TencentGR-10M", name="user_feat", split="train")
# Load multimodal embeddings
ds_emb = load_dataset("TAAC2025/TencentGR-10M", name="mm_emb_81_32", split="train")
```
configs:
- config_name: candidate
data_files:
- split: train
path: candidate/**/*.parquet
- config_name: item_feat
data_files:
- split: train
path: item_feat/**/*.parquet
- config_name: seq
data_files:
- split: train
path: seq/**/*.parquet
- config_name: user_feat
data_files:
- split: train
path: user_feat/**/*.parquet
- config_name: mm_emb_81_32
data_files:
- split: train
path: mm_emb/emb_81_32_parquet/**/*.parquet
- config_name: mm_emb_82_1024
data_files:
- split: train
path: mm_emb/emb_82_1024_parquet/**/*.parquet
- config_name: mm_emb_83_3584
data_files:
- split: train
path: mm_emb/emb_83_3584_parquet/**/*.parquet
- config_name: mm_emb_84_32
data_files:
- split: train
path: mm_emb/emb_84_32_parquet/**/*.parquet
- config_name: mm_emb_85_3584
data_files:
- split: train
path: mm_emb/emb_85_3584_1210_parquet/**/*.parquet
- config_name: mm_emb_86_3584
data_files:
- split: train
path: mm_emb/emb_86_3584_1210_parquet/**/*.parquet
license: cc-by-4.0
# TencentGR-10M 数据集
本数据集为**2025年腾讯广告算法大赛复赛数据集**,即TencentGR-10M数据集,是专为工业广告场景下的生成式推荐(Generative Recommendation, GR)设计的大规模全模态数据集。与[TencentGR-1M](https://huggingface.co/datasets/TAAC2025/TencentGR-1M)类似,其数据源自腾讯广告的真实去标识化日志,旨在解决生成式推荐领域缺乏真实公开多模态数据集的痛点。
TencentGR-10M与TencentGR-1M的核心差异如下:
1. **数据规模**:包含1000万条用户行为序列,每条序列最多包含100个交互商品。
2. **标注信息**:序列内的每一次交互均带有显式标注信号:**曝光(0)**、**点击(1)**与**转化(2)**。
## 数据集结构
### 总览
| 配置名称 | 路径 | 近似大小 | 描述 |
|---|---|---|---|
| `candidate` | `candidate/` | ~97 MB | 候选商品集 |
| `item_feat` | `item_feat/` | ~348 MB | 商品特征 |
| `seq` | `seq/` | ~9.8 GB | 用户行为序列 |
| `user_feat` | `user_feat/` | ~88 MB | 用户特征 |
| `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~5.0 GB | 多模态嵌入(维度=32) |
| `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~36 GB | 多模态嵌入(维度=1024) |
| `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~116 GB | 多模态嵌入(维度=3584) |
| `mm_emb_84_32` | `mm_emb/emb_84_32_parquet/` | ~3.7 GB | 多模态嵌入(维度=32) |
| `mm_emb_85_3584` | `mm_emb/emb_85_3584_1210_parquet/` | ~116 GB | 多模态嵌入(维度=3584) |
| `mm_emb_86_3584` | `mm_emb/emb_86_3584_1210_parquet/` | ~100 GB | 多模态嵌入(维度=3584) |
### 附加文件
| 文件 | 大小 | 描述 |
|---|---|---|
| `indexer.pkl` | ~503 MB | 索引映射文件(用于将原始ID映射为重映射ID) |
### 数据格式
除`indexer.pkl`外,所有数据文件均采用**Snappy压缩的Parquet**格式存储。
### 数据模式(Schema)
为清晰简洁起见,下文将逐一给出各数据表的详细模式说明(与TencentGR-1M完全一致)。
需注意本数据集包含两种ID类型:原始ID与重映射ID,下文将分别简称为OID与RID。其中OID仅用于`mm_emb`相关数据表,RID则适用于所有训练数据,可直接用于模型构建。原始ID与重映射ID的映射关系可通过`indexer.pkl`文件获取。
#### `item_feat`
`item_feat`数据表包含`seq`数据集中出现的所有商品的特征信息。
| **字段名** | **数据类型** | **描述** | **非空值数量** |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | 对应商品的重映射ID(RID) |4783154 |
| `100` | int64 | 加密特征字段 | 4779045 |
| `101` | int64 | 加密特征字段 | 4779045 |
| `102` | int64 | 加密特征字段 | 4735917 |
| `112` | int64 | 加密特征字段 | 4701740 |
| `114` | int64 | 加密特征字段 | 4778327 |
| `115` | int64 | 加密特征字段 | 1531415 |
| `116` | int64 | 加密特征字段 | 4778146 |
| `117` | int64 | 加密特征字段 | 4701740 |
| `118` | int64 | 加密特征字段 | 4700703 |
| `119` | int64 | 加密特征字段 | 4699894 |
| `120` | int64 | 加密特征字段 | 4694982 |
| `121` | int64 | 加密特征字段 | 4783154 |
| `122` | int64 | 加密特征字段 | 4779045 |
#### `user_feat`
`user_feat`数据表包含数据集中所有用户的特征信息。
| **字段名** | **数据类型** | **描述** | **非空值数量** |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | 对应用户的重映射ID(RID) | 1001845 |
| `103` | int64 | 加密特征字段 | 1000964 |
| `104` | int64 | 加密特征字段 | 998043 |
| `105` | int64 | 加密特征字段 | 859602 |
| `106` | List[int64] | 加密特征字段 | 880754 |
| `107` | List[int64] | 加密特征字段 | 387686 |
| `108` | List[int64] | 加密特征字段 | 170678 |
| `109` | int64 | 加密特征字段 | 1001467 |
| `110` | List[int64] | 加密特征字段 | 430598 |
#### `seq`
`seq`数据表存储每个用户的行为序列信息。
| **字段名** | **数据类型** | **描述** | **非空值数量** |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | 对应用户的重映射ID(RID) | 1001845 |
| `seq` | List[Dict] | 单条用户行为序列,每个字典包含3个键:`item_id`(重映射ID,RID)、`action_type`与`timestamp`,所有值均为整数类型 | 1001845 |
#### `candidate`
`candidate`数据表包含本次大赛所用的候选商品集合。
注意事项:
1. 该候选集仅为大赛专用,未提供真实标签,参赛者可参考该格式构建自定义候选商品集。
2. 该候选集中包含部分未在`seq`数据集中出现的商品。
| **字段名** | **数据类型** | **描述** | **非空值数量** |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | 对应商品的原始ID(OID) | 660000 |
| `retrieval_id` | int64 | 用于FAISS检索的重映射ID(从0开始计数) | 660000 |
| `100` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 |
| `101` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 |
| `102` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 653852 |
| `112` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654893 |
| `114` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659093 |
| `115` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 195552 |
| `116` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659090 |
| `117` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654893 |
| `118` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654886 |
| `119` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654882 |
| `120` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654870 |
| `121` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 660000 |
| `122` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 |
#### `mm_emb`
`mm_emb`系列数据表存储每个商品的多模态嵌入向量。本次数据集共包含6组嵌入向量,分别对应维度为`[32, 1024, 3584, 4096, 3584, 3584]`的嵌入结果,存储于6个独立文件中,对应标识为`[81, 82, 83, 84, 85, 86]`。
以`81`组嵌入向量为例,其数据表结构如下:
| **字段名** | **数据类型** | **描述** | **非空值数量** |
|:---:|:---:|:---:|:---:|
| `anonymous_cid` | string | 对应商品的原始ID(OID) | 4742961 |
| `emb` | List[double] | 商品的多模态嵌入向量 | 4742961 |
#### `indexer.pkl`
该文件为ID/特征值的重映射文件,其存储格式为Python字典,结构如下:
json
{
"u":
{
OID: RID,
...
},
"i":
{
OID: RID,
...
},
"f":
{
101:
{
10100000000: 1, // 原始特征值: 重映射特征值
10100000001: 2,
...
},
102:
{
1020000000: 1,
1020000001: 2,
...
},
...
}
}
## 使用方法
python
from datasets import load_dataset
# 加载指定配置的数据集
ds = load_dataset("TAAC2025/TencentGR-10M", name="candidate", split="train")
# 加载商品特征数据集
ds_item = load_dataset("TAAC2025/TencentGR-10M", name="item_feat", split="train")
# 加载用户行为序列数据集
ds_seq = load_dataset("TAAC2025/TencentGR-10M", name="seq", split="train")
# 加载用户特征数据集
ds_user = load_dataset("TAAC2025/TencentGR-10M", name="user_feat", split="train")
# 加载多模态嵌入数据集
ds_emb = load_dataset("TAAC2025/TencentGR-10M", name="mm_emb_81_32", split="train")
提供机构:
kimhz



