kimhz/TencentGR-10M

Name: kimhz/TencentGR-10M
Creator: kimhz
Published: 2026-04-10 17:08:43
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/kimhz/TencentGR-10M

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: candidate data_files: - split: train path: candidate/**/*.parquet - config_name: item_feat data_files: - split: train path: item_feat/**/*.parquet - config_name: seq data_files: - split: train path: seq/**/*.parquet - config_name: user_feat data_files: - split: train path: user_feat/**/*.parquet - config_name: mm_emb_81_32 data_files: - split: train path: mm_emb/emb_81_32_parquet/**/*.parquet - config_name: mm_emb_82_1024 data_files: - split: train path: mm_emb/emb_82_1024_parquet/**/*.parquet - config_name: mm_emb_83_3584 data_files: - split: train path: mm_emb/emb_83_3584_parquet/**/*.parquet - config_name: mm_emb_84_32 data_files: - split: train path: mm_emb/emb_84_32_parquet/**/*.parquet - config_name: mm_emb_85_3584 data_files: - split: train path: mm_emb/emb_85_3584_1210_parquet/**/*.parquet - config_name: mm_emb_86_3584 data_files: - split: train path: mm_emb/emb_86_3584_1210_parquet/**/*.parquet license: cc-by-4.0 --- # TencentGR-10M Dataset TAAC2025 Second Round Dataset(2025年腾讯广告算法大赛复赛数据集) TencentGR-10M Dataset is a large-scale, all-modality dataset designed specifically for generative recommendation (GR) in industrial advertising. Similar to [TencentGR-1M](https://huggingface.co/datasets/TAAC2025/TencentGR-1M), it is constructed from real, de-identified Tencent Ads logs, and aims to address the lack of realistic, public multi-modal datasets in the GR field. The main differences between TencentGR-10M and TencentGR-1M are: - Dataset Size: Provides **10 million** user sequences, with each user sequence containing up to 100 interacted items. - Labels: Each interaction within the sequence is explicitly labeled with **exposure(0)**, **click(1)**, and **conversion(2)** signals. ## Dataset Structure ### Overview | Config Name | Path | Approx. Size | Description | |---|---|---|---| | `candidate` | `candidate/` | ~97 MB | Candidate item set | | `item_feat` | `item_feat/` | ~348 MB | Item features | | `seq` | `seq/` | ~9.8 GB | User behavior sequences | | `user_feat` | `user_feat/` | ~88 MB | User features | | `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~5.0 GB | Multimodal embedding (dim=32) | | `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~36 GB | Multimodal embedding (dim=1024) | | `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~116 GB | Multimodal embedding (dim=3584) | | `mm_emb_84_32` | `mm_emb/emb_84_32_parquet/` | ~3.7 GB | Multimodal embedding (dim=32) | | `mm_emb_85_3584` | `mm_emb/emb_85_3584_1210_parquet/` | ~116 GB | Multimodal embedding (dim=3584) | | `mm_emb_86_3584` | `mm_emb/emb_86_3584_1210_parquet/` | ~100 GB | Multimodal embedding (dim=3584) | ### Additional Files | File | Size | Description | |---|---|---| | `indexer.pkl` | ~503 MB | Index mapping file (From original ID to remapped ID) | ### Data Format All data files (except `indexer.pkl`) are stored in **Snappy-compressed Parquet** format. ### Schema For clarity and brevity, we provide detailed schema descriptions for each table below (**Same as TencentGR-1M**). Need to notice that we use two types of IDs in the dataset: the original IDs and the remapped IDs, for simplicity, we will denote them as OID and RID. OIDs are used in `mm_emb`, and RIDs are used in all the training data and can be used for building models. The mapping between OIDs and RIDs can be found in the `indexer.pkl` file. #### `item_feat` The `item_feat` table contains the features of each item appeared in the `seq` set.  | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `item_id` | int64 | RID for each item. |4783154 | | `100` | int64 | An encrypted feature. | 4779045 | | `101` | int64 | An encrypted feature. | 4779045 | | `102` | int64 | An encrypted feature. | 4735917 | | `112` | int64 | An encrypted feature. | 4701740 | | `114` | int64 | An encrypted feature. | 4778327 | | `115` | int64 | An encrypted feature. | 1531415 | | `116` | int64 | An encrypted feature. | 4778146 | | `117` | int64 | An encrypted feature. | 4701740 | | `118` | int64 | An encrypted feature. | 4700703 | | `119` | int64 | An encrypted feature. | 4699894 | | `120` | int64 | An encrypted feature. | 4694982 | | `121` | int64 | An encrypted feature. | 4783154 | | `122` | int64 | An encrypted feature. | 4779045 | #### `user_feat` The `user_feat` table contains the features of each user appeared in the dataset. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `user_id` | int64 | RID for each user. | 1001845 | | `103` | int64 | An encrypted feature. | 1000964 | | `104` | int64 | An encrypted feature. | 998043 | | `105` | int64 | An encrypted feature. | 859602 | | `106` | List\[int64\] | An encrypted feature. | 880754 | | `107` | List\[int64\] | An encrypted feature. | 387686 | | `108` | List\[int64\] | An encrypted feature. | 170678 | | `109` | int64 | An encrypted feature. | 1001467 | | `110` | List\[int64\] | An encrypted feature. | 430598 | #### `seq` The `seq` table contains the behavior sequence for each user. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `user_id` | int64 | RID for each user. | 1001845 | | `seq` | List\[Dict\] | The behavior sequence for each user, each dict contains 3 keys: `item_id`(RID), `action_type`, and `timestamp`, where the values are all integers | 1001845 | #### `candidate` The `candidate` table contains the candidate items for the competition. Note: - This `candidate` is for the competition, but we do not provide the ground truth labels. People may refer to this format to build their own candidate set. - The `candidate` contains some items that are not in the `seq`. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `item_id` | int64 | OID for each item. | 660000 | | `retrieval_id` | int64 | The remapped ID for faiss retrieval (Start from 0). | 660000 | | `100` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | | `101` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | | `102` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 653852 | | `112` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 | | `114` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659093 | | `115` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 195552 | | `116` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659090 | | `117` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 | | `118` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654886 | | `119` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654882 | | `120` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654870 | | `121` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 660000 | | `122` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | #### `mm_emb` The `mm_emb` tables contain the multimodal embeddings for each item. There are 6 different embedding dimensions(`[32, 1024, 3584, 4096, 3584, 3584]`) for 6 different embeddings(`[81, 82, 83, 84, 85, 86]`) placed in 6 files. Take the `81` embedding as an example: | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `anonymous_cid` | string | OID for each item. | 4742961 | | `emb` | List\[ double \] | Embedding for each item. | 4742961 | #### `indexer.pkl` This is a remapping file that maps the original IDs/Values to the remapped IDs/Values. The format is a dictionary with the following structure: ```json { "u": { OID: RID, ... }, "i": { OID: RID, ... }, "f": { 101: { 10100000000: 1, // original value: remapped value 10100000001: 2, ... }, 102: { 1020000000: 1, 1020000001: 2, ... }, ... } } ``` ## Usage ```python from datasets import load_dataset # Load a specific config ds = load_dataset("TAAC2025/TencentGR-10M", name="candidate", split="train") # Load item features ds_item = load_dataset("TAAC2025/TencentGR-10M", name="item_feat", split="train") # Load user behavior sequences ds_seq = load_dataset("TAAC2025/TencentGR-10M", name="seq", split="train") # Load user features ds_user = load_dataset("TAAC2025/TencentGR-10M", name="user_feat", split="train") # Load multimodal embeddings ds_emb = load_dataset("TAAC2025/TencentGR-10M", name="mm_emb_81_32", split="train") ```

configs: - config_name: candidate data_files: - split: train path: candidate/**/*.parquet - config_name: item_feat data_files: - split: train path: item_feat/**/*.parquet - config_name: seq data_files: - split: train path: seq/**/*.parquet - config_name: user_feat data_files: - split: train path: user_feat/**/*.parquet - config_name: mm_emb_81_32 data_files: - split: train path: mm_emb/emb_81_32_parquet/**/*.parquet - config_name: mm_emb_82_1024 data_files: - split: train path: mm_emb/emb_82_1024_parquet/**/*.parquet - config_name: mm_emb_83_3584 data_files: - split: train path: mm_emb/emb_83_3584_parquet/**/*.parquet - config_name: mm_emb_84_32 data_files: - split: train path: mm_emb/emb_84_32_parquet/**/*.parquet - config_name: mm_emb_85_3584 data_files: - split: train path: mm_emb/emb_85_3584_1210_parquet/**/*.parquet - config_name: mm_emb_86_3584 data_files: - split: train path: mm_emb/emb_86_3584_1210_parquet/**/*.parquet license: cc-by-4.0 # TencentGR-10M 数据集本数据集为**2025年腾讯广告算法大赛复赛数据集**，即TencentGR-10M数据集，是专为工业广告场景下的生成式推荐（Generative Recommendation, GR）设计的大规模全模态数据集。与[TencentGR-1M](https://huggingface.co/datasets/TAAC2025/TencentGR-1M)类似，其数据源自腾讯广告的真实去标识化日志，旨在解决生成式推荐领域缺乏真实公开多模态数据集的痛点。 TencentGR-10M与TencentGR-1M的核心差异如下： 1. **数据规模**：包含1000万条用户行为序列，每条序列最多包含100个交互商品。 2. **标注信息**：序列内的每一次交互均带有显式标注信号：**曝光(0)**、**点击(1)**与**转化(2)**。 ## 数据集结构 ### 总览 | 配置名称 | 路径 | 近似大小 | 描述 | |---|---|---|---| | `candidate` | `candidate/` | ~97 MB | 候选商品集 | | `item_feat` | `item_feat/` | ~348 MB | 商品特征 | | `seq` | `seq/` | ~9.8 GB | 用户行为序列 | | `user_feat` | `user_feat/` | ~88 MB | 用户特征 | | `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~5.0 GB | 多模态嵌入（维度=32） | | `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~36 GB | 多模态嵌入（维度=1024） | | `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~116 GB | 多模态嵌入（维度=3584） | | `mm_emb_84_32` | `mm_emb/emb_84_32_parquet/` | ~3.7 GB | 多模态嵌入（维度=32） | | `mm_emb_85_3584` | `mm_emb/emb_85_3584_1210_parquet/` | ~116 GB | 多模态嵌入（维度=3584） | | `mm_emb_86_3584` | `mm_emb/emb_86_3584_1210_parquet/` | ~100 GB | 多模态嵌入（维度=3584） | ### 附加文件 | 文件 | 大小 | 描述 | |---|---|---| | `indexer.pkl` | ~503 MB | 索引映射文件（用于将原始ID映射为重映射ID） | ### 数据格式除`indexer.pkl`外，所有数据文件均采用**Snappy压缩的Parquet**格式存储。 ### 数据模式（Schema）为清晰简洁起见，下文将逐一给出各数据表的详细模式说明（与TencentGR-1M完全一致）。需注意本数据集包含两种ID类型：原始ID与重映射ID，下文将分别简称为OID与RID。其中OID仅用于`mm_emb`相关数据表，RID则适用于所有训练数据，可直接用于模型构建。原始ID与重映射ID的映射关系可通过`indexer.pkl`文件获取。 #### `item_feat` `item_feat`数据表包含`seq`数据集中出现的所有商品的特征信息。 | **字段名** | **数据类型** | **描述** | **非空值数量** | |:---:|:---:|:---:|:---:| | `item_id` | int64 | 对应商品的重映射ID（RID） |4783154 | | `100` | int64 | 加密特征字段 | 4779045 | | `101` | int64 | 加密特征字段 | 4779045 | | `102` | int64 | 加密特征字段 | 4735917 | | `112` | int64 | 加密特征字段 | 4701740 | | `114` | int64 | 加密特征字段 | 4778327 | | `115` | int64 | 加密特征字段 | 1531415 | | `116` | int64 | 加密特征字段 | 4778146 | | `117` | int64 | 加密特征字段 | 4701740 | | `118` | int64 | 加密特征字段 | 4700703 | | `119` | int64 | 加密特征字段 | 4699894 | | `120` | int64 | 加密特征字段 | 4694982 | | `121` | int64 | 加密特征字段 | 4783154 | | `122` | int64 | 加密特征字段 | 4779045 | #### `user_feat` `user_feat`数据表包含数据集中所有用户的特征信息。 | **字段名** | **数据类型** | **描述** | **非空值数量** | |:---:|:---:|:---:|:---:| | `user_id` | int64 | 对应用户的重映射ID（RID） | 1001845 | | `103` | int64 | 加密特征字段 | 1000964 | | `104` | int64 | 加密特征字段 | 998043 | | `105` | int64 | 加密特征字段 | 859602 | | `106` | List[int64] | 加密特征字段 | 880754 | | `107` | List[int64] | 加密特征字段 | 387686 | | `108` | List[int64] | 加密特征字段 | 170678 | | `109` | int64 | 加密特征字段 | 1001467 | | `110` | List[int64] | 加密特征字段 | 430598 | #### `seq` `seq`数据表存储每个用户的行为序列信息。 | **字段名** | **数据类型** | **描述** | **非空值数量** | |:---:|:---:|:---:|:---:| | `user_id` | int64 | 对应用户的重映射ID（RID） | 1001845 | | `seq` | List[Dict] | 单条用户行为序列，每个字典包含3个键：`item_id`（重映射ID，RID）、`action_type`与`timestamp`，所有值均为整数类型 | 1001845 | #### `candidate` `candidate`数据表包含本次大赛所用的候选商品集合。注意事项： 1. 该候选集仅为大赛专用，未提供真实标签，参赛者可参考该格式构建自定义候选商品集。 2. 该候选集中包含部分未在`seq`数据集中出现的商品。 | **字段名** | **数据类型** | **描述** | **非空值数量** | |:---:|:---:|:---:|:---:| | `item_id` | int64 | 对应商品的原始ID（OID） | 660000 | | `retrieval_id` | int64 | 用于FAISS检索的重映射ID（从0开始计数） | 660000 | | `100` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 | | `101` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 | | `102` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 653852 | | `112` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654893 | | `114` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659093 | | `115` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 195552 | | `116` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659090 | | `117` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654893 | | `118` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654886 | | `119` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654882 | | `120` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 654870 | | `121` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 660000 | | `122` | Dict["cold_start": int64, "feature_value": string] | 加密特征字段 | 659206 | #### `mm_emb` `mm_emb`系列数据表存储每个商品的多模态嵌入向量。本次数据集共包含6组嵌入向量，分别对应维度为`[32, 1024, 3584, 4096, 3584, 3584]`的嵌入结果，存储于6个独立文件中，对应标识为`[81, 82, 83, 84, 85, 86]`。以`81`组嵌入向量为例，其数据表结构如下： | **字段名** | **数据类型** | **描述** | **非空值数量** | |:---:|:---:|:---:|:---:| | `anonymous_cid` | string | 对应商品的原始ID（OID） | 4742961 | | `emb` | List[double] | 商品的多模态嵌入向量 | 4742961 | #### `indexer.pkl` 该文件为ID/特征值的重映射文件，其存储格式为Python字典，结构如下： json { "u": { OID: RID, ... }, "i": { OID: RID, ... }, "f": { 101: { 10100000000: 1, // 原始特征值: 重映射特征值 10100000001: 2, ... }, 102: { 1020000000: 1, 1020000001: 2, ... }, ... } } ## 使用方法 python from datasets import load_dataset # 加载指定配置的数据集 ds = load_dataset("TAAC2025/TencentGR-10M", name="candidate", split="train") # 加载商品特征数据集 ds_item = load_dataset("TAAC2025/TencentGR-10M", name="item_feat", split="train") # 加载用户行为序列数据集 ds_seq = load_dataset("TAAC2025/TencentGR-10M", name="seq", split="train") # 加载用户特征数据集 ds_user = load_dataset("TAAC2025/TencentGR-10M", name="user_feat", split="train") # 加载多模态嵌入数据集 ds_emb = load_dataset("TAAC2025/TencentGR-10M", name="mm_emb_81_32", split="train")

提供机构：

kimhz

5,000+

优质数据集

54 个

任务类型

进入经典数据集