TAAC2025/TencentGR-1M
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TAAC2025/TencentGR-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: candidate
data_files:
- split: train
path: candidate/**/*.parquet
- config_name: item_feat
data_files:
- split: train
path: item_feat/**/*.parquet
- config_name: seq
data_files:
- split: train
path: seq/**/*.parquet
- config_name: user_feat
data_files:
- split: train
path: user_feat/**/*.parquet
- config_name: mm_emb_81_32
data_files:
- split: train
path: mm_emb/emb_81_32_parquet/**/*.parquet
- config_name: mm_emb_82_1024
data_files:
- split: train
path: mm_emb/emb_82_1024_parquet/**/*.parquet
- config_name: mm_emb_83_3584
data_files:
- split: train
path: mm_emb/emb_83_3584_parquet/**/*.parquet
- config_name: mm_emb_84_4096
data_files:
- split: train
path: mm_emb/emb_84_4096_parquet/**/*.parquet
- config_name: mm_emb_85_3584
data_files:
- split: train
path: mm_emb/emb_85_3584_parquet/**/*.parquet
- config_name: mm_emb_86_3584
data_files:
- split: train
path: mm_emb/emb_86_3584_parquet/**/*.parquet
license: cc-by-4.0
---
# TencentGR-1M Dataset
TAAC2025 Preliminary Round Dataset(2025年腾讯广告算法大赛初赛数据集) TencentGR-1M Dataset is a large-scale, all-modality dataset designed specifically for generative recommendation (GR) in industrial advertising. Constructed from real, de-identified Tencent Ads logs, it aims to address the lack of realistic, public multi-modal datasets in the GR field.
- Data Features: Contains rich collaborative IDs and multi-modal representations (text and vision) extracted using state-of-the-art embedding models.
- Dataset Size: Provides 1 million user sequences, with each user sequence containing up to 100 interacted items.
- Labels: Each interaction within the sequence is explicitly labeled with **exposure(0)** and **click(1)** signals.
## Dataset Structure
### Overview
| Config Name | Path | Approx. Size | Description |
|---|---|---|---|
| `candidate` | `candidate/` | ~22 MB | Candidate item set |
| `item_feat` | `item_feat/` | ~104 MB | Item features |
| `seq` | `seq/` | ~881 MB | User behavior sequences |
| `user_feat` | `user_feat/` | ~8.4 MB | User features |
| `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~901 MB | Multimodal embedding (dim=32) |
| `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~9.4 GB | Multimodal embedding (dim=1024) |
| `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~31 GB | Multimodal embedding (dim=3584) |
| `mm_emb_84_4096` | `mm_emb/emb_84_4096_parquet/` | ~30 GB | Multimodal embedding (dim=4096) |
| `mm_emb_85_3584` | `mm_emb/emb_85_3584_parquet/` | ~31 GB | Multimodal embedding (dim=3584) |
| `mm_emb_86_3584` | `mm_emb/emb_86_3584_parquet/` | ~26 GB | Multimodal embedding (dim=3584) |
### Additional Files
| File | Size | Description |
|---|---|---|
| `indexer.pkl` | ~142 MB | Index mapping file (From original ID to remapped ID) |
### Data Format
All data files are stored in **Snappy-compressed Parquet** format.
### Schema
For clarity and brevity, we provide detailed schema descriptions for each table below.
Need to notice that we use two types of IDs in the dataset: the original IDs and the remapped IDs, for simplicity, we will denote them as OID and RID. OIDs are used in `mm_emb`, and RIDs are used in all the training data and can be used for building models. The mapping between OIDs and RIDs can be found in the `indexer.pkl` file.
#### `item_feat`
The `item_feat` table contains the features of each item appeared in the `seq` set.
<!-- 15 x 3 table: -->
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | RID for each item. |4783154 |
| `100` | int64 | An encrypted feature. | 4779045 |
| `101` | int64 | An encrypted feature. | 4779045 |
| `102` | int64 | An encrypted feature. | 4735917 |
| `112` | int64 | An encrypted feature. | 4701740 |
| `114` | int64 | An encrypted feature. | 4778327 |
| `115` | int64 | An encrypted feature. | 1531415 |
| `116` | int64 | An encrypted feature. | 4778146 |
| `117` | int64 | An encrypted feature. | 4701740 |
| `118` | int64 | An encrypted feature. | 4700703 |
| `119` | int64 | An encrypted feature. | 4699894 |
| `120` | int64 | An encrypted feature. | 4694982 |
| `121` | int64 | An encrypted feature. | 4783154 |
| `122` | int64 | An encrypted feature. | 4779045 |
#### `user_feat`
The `user_feat` table contains the features of each user appeared in the dataset.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | RID for each user. | 1001845 |
| `103` | int64 | An encrypted feature. | 1000964 |
| `104` | int64 | An encrypted feature. | 998043 |
| `105` | int64 | An encrypted feature. | 859602 |
| `106` | List\[int64\] | An encrypted feature. | 880754 |
| `107` | List\[int64\] | An encrypted feature. | 387686 |
| `108` | List\[int64\] | An encrypted feature. | 170678 |
| `109` | int64 | An encrypted feature. | 1001467 |
| `110` | List\[int64\] | An encrypted feature. | 430598 |
#### `seq`
The `seq` table contains the behavior sequence for each user.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `user_id` | int64 | RID for each user. | 1001845 |
| `seq` | List\[Dict\] | The behavior sequence for each user, each dict contains 3 keys: `item_id`(RID), `action_type`, and `timestamp`, where the values are all integers | 1001845 |
#### `candidate`
The `candidate` table contains the candidate items for the competition.
Note:
- This `candidate` is for the competition, but we do not provide the ground truth labels. People may refer to this format to build their own candidate set.
- The `candidate` contains some items that are not in the `seq`.
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `item_id` | int64 | OID for each item. | 660000 |
| `retrieval_id` | int64 | The remapped ID for faiss retrieval (Start from 0). | 660000 |
| `100` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
| `101` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
| `102` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 653852 |
| `112` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 |
| `114` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659093 |
| `115` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 195552 |
| `116` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659090 |
| `117` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 |
| `118` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654886 |
| `119` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654882 |
| `120` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654870 |
| `121` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 660000 |
| `122` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 |
#### `mm_emb`
The `mm_emb` tables contain the multimodal embeddings for each item. There are 6 different embedding dimensions(`[32, 1024, 3584, 4096, 3584, 3584]`) for 6 different embeddings(`[81, 82, 83, 84, 85, 86]`) placed in 6 files.
Take the `81` embedding as an example:
| **Field** | **Type** | **Description** | \# Non-None Values |
|:---:|:---:|:---:|:---:|
| `anonymous_cid` | string | OID for each item. | 4742961 |
| `emb` | List\[ double \] | Embedding for each item. | 4742961 |
#### `indexer.pkl`
This is a remapping file that maps the original IDs/Values to the remapped IDs/Values. The format is a dictionary with the following structure:
```json
{
"u":
{
OID: RID,
...
},
"i":
{
OID: RID,
...
},
"f":
{
101:
{
10100000000: 1, // original value: remapped value
10100000001: 2,
...
},
102:
{
1020000000: 1,
1020000001: 2,
...
},
...
}
}
```
## Usage
```python
from datasets import load_dataset
# Load a specific config
ds = load_dataset("TAAC2025/TencentGR-1M", name="candidate", split="train")
# Load item features
ds_item = load_dataset("TAAC2025/TencentGR-1M", name="item_feat", split="train")
# Load user behavior sequences
ds_seq = load_dataset("TAAC2025/TencentGR-1M", name="seq", split="train")
# Load user features
ds_user = load_dataset("TAAC2025/TencentGR-1M", name="user_feat", split="train")
# Load multimodal embeddings
ds_emb = load_dataset("TAAC2025/TencentGR-1M", name="mm_emb_81_32", split="train")
```
提供机构:
TAAC2025



