five

T-ECD

收藏
魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/t-tech/T-ECD
下载链接
链接失效反馈
官方服务:
资源简介:
# T-ECD: T-Tech E-commerce Cross-Domain Dataset ![image (2)](https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/t8w2QTQbnH0DhEIbJYCnT.png) ⭐️ **T-ECD** is a large-scale synthetic cross-domain dataset for recommender systems research, created by T-Bank's RecSys R&D team. It captures real-world e-commerce interaction patterns across multiple domains while ensuring complete anonymity through synthetic generation. 🎯 Overview T-ECD represents user interactions across five different e-commerce domains within a banking ecosystem: - **Marketplace** — browsing and interacting with items in an e-commerce marketplace. - **Retail** — interactions within a retail delivery service, including cart additions and completed orders. - **Payments** — online and offline financial transactions between users and brands. - **Offers** — responses to promotional content such as impressions, clicks, and partner transitions. - **Reviews** — explicit user feedback in the form of ratings and embeddings of textual comments. **Scale:** - **~135B** interactions - ~44M users - ~30M items - **1300+ days of temporal coverage** Additionally, we provide **T-ECD Small** - a compact version containing 1B interactions that excludes the Payments domain. <div style="font-size: 1.1em;"> | Metric | T-ECD Small | T-ECD Full | |--------|-------------|------------| | 🔄 **Interactions** | ~1B | **~135B** | | 👥 **Users** | ~3.5M | **~44M** | | 📦 **Items** | ~2.6M | **~30M** | | 🏪 **Brands** | ~29K | **~1M** | | 📅 **Temporal Coverage** | 200+ days | **1300+ days** | | 🌐 **Domains** | 4 (excl. Payments) | **5 (all domains)** | </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Y3hHv_cipdq2p4A9jiQoz.png" style="max-width: 80%; height: auto;"> Cross-domain consistency is achieved by aligning identifiers across all domains: - the same `user_id` always refers to the same individual user, and - the same `brand_id` always refers to the same brand entity. This alignment allows researchers to seamlessly link interactions from different services, enabling studies in transfer learning, cross-domain personalization, and multi-task modeling. <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/QG0DavvcvccN1GcN_gRL6.png" style="max-width: 80%; height: auto;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/s8a8iC4RmUjsD_hzOVPvD.png" style="max-width: 80%; height: auto;"> --- ### 📂 Data Schema The dataset is stored in **Parquet** format with daily partitions (`{day}`). The directory structure is as follows: ``` t-ecd/ ├── users.pq ├── brands.pq ├── marketplace/ │ ├── events/{day}.pq │ └── items.pq ├── retail/ │ ├── events/{day}.pq │ └── items.pq ├── payments/ │ ├── events/{day}.pq │ └── receipts/{day}.pq ├── offers/ │ ├── events/{day}.pq │ └── items.pq └── reviews/{day}.pq ``` #### Data availability <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/c2Clc9bNxL9i7jgGBfBq2.png" style="max-width: 80%; height: auto;" alt="Temporal distribution of events over domains"> *Temporal distribution of events over domains* In line with real-world industrial environments, domain-specific data availability varies in historical depth. This reflects practical constraints including data retention policies and product lifecycle stages - newer e-commerce services naturally have shorter histories compared to established banking domains like payments and transactions. ### ⚙️ Events and Catalogs - **Events**: Each domain provides logs of user interactions with the following possible columns: - `action_type` — interaction type (e.g., view, click, add-to-cart, order, transaction). - `subdomain` — surface where the interaction occurred (recommendations, catalog, search, checkout, campaign); available in Marketplace and Retail. - `item_id` — present in Marketplace, Retail, and Offers; identifies a specific product or offer. - `brand_id` — present in all domains; denotes the seller, store, or partner associated with an item, offer, or transaction. - `price` — represents the monetary value of the interaction. - `count` — represents the amount of items in single interaction. - `os` — user operating system, available in Marketplace and Retail. <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Q7aeb_I-Yf-rcqyPDTOLa.png" style="max-width: 80%; height: auto;" > - **Item catalogs (`items.pq`)**: Available for Marketplace, Retail, and Offers. Each entry includes: - `item_id` - `brand_id` - category information (if available) - pretrained embedding (if available) - **User catalog (`users.pq`)**: Contains anonymized user attributes such as region and socio-demographic cluster. - **Brand catalog (`brands.pq`)**: Contains `brand_id`, brand-level metadata, and embeddings. #### 🧾 Special Structures - **Receipts (`payments/receipts/{day}.pq`)**: Some transactions include detailed receipts with purchased items, their quantities, and prices. Items are aligned with Marketplace and Retail catalogs, enabling fine-grained cross-domain linkage at the product level. - **Reviews (`reviews/{day}.pq`)**: Provide explicit ratings per brand. Raw text reviews are not included; instead, we release pretrained text embeddings to preserve privacy while enabling multimodal research. --- ### 🛠️ Data Collection T-ECD was generated through a multi-step process: 1. **Sampling of event chains**: sequences of interactions were sampled from real logs of T-Bank ecosystem services. 2. **Anonymization**: user and brand identifiers were pseudonymized; sensitive attributes removed. 3. **Synthetic generation**: based on real distributions and event patterns, new synthetic interaction chains were produced, preserving structural properties such as sparsity, heavy tails, cross-domain overlaps, and behavioral contexts. This process ensures that the dataset is privacy-preserving while remaining representative of industrial recommender system data. ## ⚠️ Important Note on Temporal Data Usage <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/zaPAcuD3CItTzP2PBkErs.png" style="max-width: 80%; height: auto;"> **To prevent data leakage, events from the final 12 hours should not be used for prediction tasks.** The dataset contains temporal noise that requires maintaining a minimum 12-hour gap between the timestamp of the most recent user event and the prediction timestamp. This constraint applies to both training and testing scenarios to avoid temporal data leakage. ## Download #### Basic Download ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="t-tech/T-ECD", repo_type="dataset", allow_patterns="dataset/full/", local_dir="./t_ecd_data", token="<your_hf_token>" ) ``` #### Selective Download For advanced usage including selection of domains and date ranges we provide custom downloader [tecd_downloader.py](https://huggingface.co/datasets/t-tech/T-ECD/blob/main/tecd_downloader.py) Example usage: ```python from tecd_downloader import download_dataset download_dataset( token="<your_hf_token>", dataset_path="dataset/small", local_dir="t_ecd_small_partial", domains=["retail", "marketplace"], day_begin=1300, day_end=1308, max_workers=10 ) ``` --- ### 🔐 License This dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) licence ---

# T-ECD:T-Tech电商跨域数据集 ![image (2)](https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/t8w2QTQbnH0DhEIbJYCnT.png) ⭐️ **T-ECD** 是由T-Bank的推荐系统(RecSys)研发团队打造的大规模合成跨域数据集,用于推荐系统领域的研究。该数据集通过合成生成技术实现全匿名化处理,同时捕捉多领域下的真实电商交互模式。 🎯 数据集概览 T-ECD涵盖了银行生态体系内五个不同电商领域的用户交互行为: - **电商商城(Marketplace)**:用户在电商商城内浏览并与商品进行交互的行为。 - **零售配送(Retail)**:零售配送服务内的交互行为,包括加购商品与完成订单的操作。 - **支付交易(Payments)**:用户与品牌之间的线上及线下金融交易行为。 - **营销活动(Offers)**:用户对推广内容的响应行为,包括曝光、点击以及合作伙伴跳转操作。 - **用户评论(Reviews)**:以评分和文本评论嵌入向量(embedding)形式呈现的显性用户反馈。 **数据集规模:** - 约**1350亿**次交互 - 约4400万用户 - 约3000万商品 - 覆盖时长超**1300天** 此外,我们还推出了**T-ECD精简版(T-ECD Small)**:该精简版本不含支付交易领域,仅包含约10亿次交互。 | 指标 | T-ECD精简版 | T-ECD完整版 | |--------|-------------|------------| | 🔄 **交互次数** | ~10亿 | **~1350亿** | | 👥 **用户数** | ~350万 | **~4400万** | | 📦 **商品数** | ~260万 | **~3000万** | | 🏪 **品牌数** | ~2.9万 | **~100万** | | 📅 **覆盖时长** | 200+天 | **1300+天** | | 🌐 **覆盖领域** | 4个(不含支付交易) | **5个(全领域)** | <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Y3hHv_cipdq2p4A9jiQoz.png" style="max-width: 80%; height: auto;"> 跨域一致性通过对齐各领域的标识符实现: - 相同的`user_id`始终对应同一用户个体 - 相同的`brand_id`始终对应同一品牌实体 这种对齐方式可让研究人员无缝关联不同服务的交互数据,支持迁移学习、跨域个性化以及多任务建模等方向的研究。 <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/QG0DavvcvccN1GcN_gRL6.png" style="max-width: 80%; height: auto;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/s8a8iC4RmUjsD_hzOVPvD.png" style="max-width: 80%; height: auto;"> --- ### 📂 数据模式(Data Schema) 数据集以**Parquet(列式存储格式)**格式存储,并按日分区(`{day}`)进行组织。目录结构如下: t-ecd/ ├── users.pq ├── brands.pq ├── marketplace/ │ ├── events/{day}.pq │ └── items.pq ├── retail/ │ ├── events/{day}.pq │ └── items.pq ├── payments/ │ ├── events/{day}.pq │ └── receipts/{day}.pq ├── offers/ │ ├── events/{day}.pq │ └── items.pq └── reviews/{day}.pq #### 数据可用性 <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/c2Clc9bNxL9i7jgGBfBq2.png" style="max-width: 80%; height: auto;" alt="Temporal distribution of events over domains"> *各领域事件的时间分布* 与真实工业场景一致,各领域数据的历史覆盖深度存在差异。这反映了实际约束条件,包括数据留存政策与产品生命周期阶段:相较于支付交易等成熟的银行生态领域,新兴电商服务的历史数据时长自然更短。 ### ⚙️ 事件与目录数据 - **事件数据**:各领域均提供用户交互日志,其可能包含以下字段: - `action_type`:交互类型(例如浏览、点击、加购、下单、交易等)。 - `subdomain`:交互发生的场景(包括推荐位、商品目录、搜索、结账页、营销活动等);仅在电商商城与零售配送领域可用。 - `item_id`:在电商商城、零售配送与营销活动领域均存在,用于标识特定商品或营销活动。 - `brand_id`:全领域均存在,用于标识与商品、营销活动或交易相关的卖家、店铺或合作伙伴。 - `price`:表示本次交互的交易金额。 - `count`:表示单次交互中的商品数量。 - `os`:用户操作系统;仅在电商商城与零售配送领域可用。 <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Q7aeb_I-Yf-rcqyPDTOLa.png" style="max-width: 80%; height: auto;" > - **商品目录(`items.pq`)**:在电商商城、零售配送与营销活动领域均可用,每条记录包含以下内容: - `item_id` - `brand_id` - 分类信息(若可用) - 预训练嵌入向量(若可用) - **用户目录(`users.pq`)**:包含经过匿名化处理的用户属性,例如地区与社会人口学聚类标签。 - **品牌目录(`brands.pq`)**:包含`brand_id`、品牌元数据以及嵌入向量。 #### 🧾 特殊数据结构 - **交易凭证(`payments/receipts/{day}.pq`)**: 部分交易包含详细的交易凭证,记录了购买商品、商品数量与价格。商品与电商商城及零售配送领域的目录数据对齐,可实现细粒度的跨域商品级关联。 - **用户评论(`reviews/{day}.pq`)**: 提供针对品牌的显性评分。原始文本评论未公开,我们仅发布预训练文本嵌入向量以保护用户隐私,同时支持多模态研究。 --- ### 🛠️ 数据采集流程 T-ECD通过多阶段流程生成: 1. **事件链采样**:从T-Bank生态服务的真实日志中采样交互序列。 2. **匿名化处理**:对用户与品牌标识符进行假名化处理,移除敏感属性。 3. **合成数据生成**:基于真实数据分布与交互模式生成全新的合成交互序列,保留数据的结构特性,例如稀疏性、重尾分布、跨域重叠性以及行为上下文。 该流程确保数据集在保护用户隐私的同时,能够真实反映工业级推荐系统数据的特性。 ## ⚠️ 时序数据使用注意事项 <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/zaPAcuD3CItTzP2PBkErs.png" style="max-width: 80%; height: auto;"> **为防止数据泄露,请勿使用最后12小时内的事件数据开展预测任务。** 数据集包含时序噪声,因此需要确保最新用户事件的时间戳与预测时间戳之间至少间隔12小时。 该约束适用于训练与测试全流程,以避免时序数据泄露。 ## 数据集下载 #### 基础下载 python from huggingface_hub import snapshot_download snapshot_download( repo_id="t-tech/T-ECD", repo_type="dataset", allow_patterns="dataset/full/", local_dir="./t_ecd_data", token="<your_hf_token>" ) #### 选择性下载 针对需要选择领域与时间范围的高级使用场景,我们提供了自定义下载工具 [tecd_downloader.py](https://huggingface.co/datasets/t-tech/T-ECD/blob/main/tecd_downloader.py) 示例用法: python from tecd_downloader import download_dataset download_dataset( token="<your_hf_token>", dataset_path="dataset/small", local_dir="t_ecd_small_partial", domains=["retail", "marketplace"], day_begin=1300, day_end=1308, max_workers=10 ) --- ### 🔐 开源许可 本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(CC BY-NC-SA 4.0)进行发布。
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作