five

TAAC2025/TencentGR-1M

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TAAC2025/TencentGR-1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: candidate data_files: - split: train path: candidate/**/*.parquet - config_name: item_feat data_files: - split: train path: item_feat/**/*.parquet - config_name: seq data_files: - split: train path: seq/**/*.parquet - config_name: user_feat data_files: - split: train path: user_feat/**/*.parquet - config_name: mm_emb_81_32 data_files: - split: train path: mm_emb/emb_81_32_parquet/**/*.parquet - config_name: mm_emb_82_1024 data_files: - split: train path: mm_emb/emb_82_1024_parquet/**/*.parquet - config_name: mm_emb_83_3584 data_files: - split: train path: mm_emb/emb_83_3584_parquet/**/*.parquet - config_name: mm_emb_84_4096 data_files: - split: train path: mm_emb/emb_84_4096_parquet/**/*.parquet - config_name: mm_emb_85_3584 data_files: - split: train path: mm_emb/emb_85_3584_parquet/**/*.parquet - config_name: mm_emb_86_3584 data_files: - split: train path: mm_emb/emb_86_3584_parquet/**/*.parquet license: cc-by-4.0 --- # TencentGR-1M Dataset TAAC2025 Preliminary Round Dataset(2025年腾讯广告算法大赛初赛数据集) TencentGR-1M Dataset is a large-scale, all-modality dataset designed specifically for generative recommendation (GR) in industrial advertising. Constructed from real, de-identified Tencent Ads logs, it aims to address the lack of realistic, public multi-modal datasets in the GR field. - Data Features: Contains rich collaborative IDs and multi-modal representations (text and vision) extracted using state-of-the-art embedding models. - Dataset Size: Provides 1 million user sequences, with each user sequence containing up to 100 interacted items. - Labels: Each interaction within the sequence is explicitly labeled with **exposure(0)** and **click(1)** signals. ## Dataset Structure ### Overview | Config Name | Path | Approx. Size | Description | |---|---|---|---| | `candidate` | `candidate/` | ~22 MB | Candidate item set | | `item_feat` | `item_feat/` | ~104 MB | Item features | | `seq` | `seq/` | ~881 MB | User behavior sequences | | `user_feat` | `user_feat/` | ~8.4 MB | User features | | `mm_emb_81_32` | `mm_emb/emb_81_32_parquet/` | ~901 MB | Multimodal embedding (dim=32) | | `mm_emb_82_1024` | `mm_emb/emb_82_1024_parquet/` | ~9.4 GB | Multimodal embedding (dim=1024) | | `mm_emb_83_3584` | `mm_emb/emb_83_3584_parquet/` | ~31 GB | Multimodal embedding (dim=3584) | | `mm_emb_84_4096` | `mm_emb/emb_84_4096_parquet/` | ~30 GB | Multimodal embedding (dim=4096) | | `mm_emb_85_3584` | `mm_emb/emb_85_3584_parquet/` | ~31 GB | Multimodal embedding (dim=3584) | | `mm_emb_86_3584` | `mm_emb/emb_86_3584_parquet/` | ~26 GB | Multimodal embedding (dim=3584) | ### Additional Files | File | Size | Description | |---|---|---| | `indexer.pkl` | ~142 MB | Index mapping file (From original ID to remapped ID) | ### Data Format All data files are stored in **Snappy-compressed Parquet** format. ### Schema For clarity and brevity, we provide detailed schema descriptions for each table below. Need to notice that we use two types of IDs in the dataset: the original IDs and the remapped IDs, for simplicity, we will denote them as OID and RID. OIDs are used in `mm_emb`, and RIDs are used in all the training data and can be used for building models. The mapping between OIDs and RIDs can be found in the `indexer.pkl` file. #### `item_feat` The `item_feat` table contains the features of each item appeared in the `seq` set. <!-- 15 x 3 table: --> | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `item_id` | int64 | RID for each item. |4783154 | | `100` | int64 | An encrypted feature. | 4779045 | | `101` | int64 | An encrypted feature. | 4779045 | | `102` | int64 | An encrypted feature. | 4735917 | | `112` | int64 | An encrypted feature. | 4701740 | | `114` | int64 | An encrypted feature. | 4778327 | | `115` | int64 | An encrypted feature. | 1531415 | | `116` | int64 | An encrypted feature. | 4778146 | | `117` | int64 | An encrypted feature. | 4701740 | | `118` | int64 | An encrypted feature. | 4700703 | | `119` | int64 | An encrypted feature. | 4699894 | | `120` | int64 | An encrypted feature. | 4694982 | | `121` | int64 | An encrypted feature. | 4783154 | | `122` | int64 | An encrypted feature. | 4779045 | #### `user_feat` The `user_feat` table contains the features of each user appeared in the dataset. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `user_id` | int64 | RID for each user. | 1001845 | | `103` | int64 | An encrypted feature. | 1000964 | | `104` | int64 | An encrypted feature. | 998043 | | `105` | int64 | An encrypted feature. | 859602 | | `106` | List\[int64\] | An encrypted feature. | 880754 | | `107` | List\[int64\] | An encrypted feature. | 387686 | | `108` | List\[int64\] | An encrypted feature. | 170678 | | `109` | int64 | An encrypted feature. | 1001467 | | `110` | List\[int64\] | An encrypted feature. | 430598 | #### `seq` The `seq` table contains the behavior sequence for each user. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `user_id` | int64 | RID for each user. | 1001845 | | `seq` | List\[Dict\] | The behavior sequence for each user, each dict contains 3 keys: `item_id`(RID), `action_type`, and `timestamp`, where the values are all integers | 1001845 | #### `candidate` The `candidate` table contains the candidate items for the competition. Note: - This `candidate` is for the competition, but we do not provide the ground truth labels. People may refer to this format to build their own candidate set. - The `candidate` contains some items that are not in the `seq`. | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `item_id` | int64 | OID for each item. | 660000 | | `retrieval_id` | int64 | The remapped ID for faiss retrieval (Start from 0). | 660000 | | `100` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | | `101` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | | `102` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 653852 | | `112` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 | | `114` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659093 | | `115` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 195552 | | `116` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659090 | | `117` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654893 | | `118` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654886 | | `119` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654882 | | `120` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 654870 | | `121` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 660000 | | `122` | Dict\[ "cold\_start": int64, "feature_value": string \] | An encrypted feature. | 659206 | #### `mm_emb` The `mm_emb` tables contain the multimodal embeddings for each item. There are 6 different embedding dimensions(`[32, 1024, 3584, 4096, 3584, 3584]`) for 6 different embeddings(`[81, 82, 83, 84, 85, 86]`) placed in 6 files. Take the `81` embedding as an example: | **Field** | **Type** | **Description** | \# Non-None Values | |:---:|:---:|:---:|:---:| | `anonymous_cid` | string | OID for each item. | 4742961 | | `emb` | List\[ double \] | Embedding for each item. | 4742961 | #### `indexer.pkl` This is a remapping file that maps the original IDs/Values to the remapped IDs/Values. The format is a dictionary with the following structure: ```json { "u": { OID: RID, ... }, "i": { OID: RID, ... }, "f": { 101: { 10100000000: 1, // original value: remapped value 10100000001: 2, ... }, 102: { 1020000000: 1, 1020000001: 2, ... }, ... } } ``` ## Usage ```python from datasets import load_dataset # Load a specific config ds = load_dataset("TAAC2025/TencentGR-1M", name="candidate", split="train") # Load item features ds_item = load_dataset("TAAC2025/TencentGR-1M", name="item_feat", split="train") # Load user behavior sequences ds_seq = load_dataset("TAAC2025/TencentGR-1M", name="seq", split="train") # Load user features ds_user = load_dataset("TAAC2025/TencentGR-1M", name="user_feat", split="train") # Load multimodal embeddings ds_emb = load_dataset("TAAC2025/TencentGR-1M", name="mm_emb_81_32", split="train") ```
提供机构:
TAAC2025
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作