T-ECD
收藏魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/t-tech/T-ECD
下载链接
链接失效反馈官方服务:
资源简介:
# T-ECD: T-Tech E-commerce Cross-Domain Dataset

⭐️ **T-ECD** is a large-scale synthetic cross-domain dataset for recommender systems research, created by T-Bank's RecSys R&D team.
It captures real-world e-commerce interaction patterns across multiple domains while ensuring complete anonymity through synthetic generation.
🎯 Overview
T-ECD represents user interactions across five different e-commerce domains within a banking ecosystem:
- **Marketplace** — browsing and interacting with items in an e-commerce marketplace.
- **Retail** — interactions within a retail delivery service, including cart additions and completed orders.
- **Payments** — online and offline financial transactions between users and brands.
- **Offers** — responses to promotional content such as impressions, clicks, and partner transitions.
- **Reviews** — explicit user feedback in the form of ratings and embeddings of textual comments.
**Scale:**
- **~135B** interactions
- ~44M users
- ~30M items
- **1300+ days of temporal coverage**
Additionally, we provide **T-ECD Small** - a compact version containing 1B interactions that excludes the Payments domain.
<div style="font-size: 1.1em;">
| Metric | T-ECD Small | T-ECD Full |
|--------|-------------|------------|
| 🔄 **Interactions** | ~1B | **~135B** |
| 👥 **Users** | ~3.5M | **~44M** |
| 📦 **Items** | ~2.6M | **~30M** |
| 🏪 **Brands** | ~29K | **~1M** |
| 📅 **Temporal Coverage** | 200+ days | **1300+ days** |
| 🌐 **Domains** | 4 (excl. Payments) | **5 (all domains)** |
</div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Y3hHv_cipdq2p4A9jiQoz.png" style="max-width: 80%; height: auto;">
Cross-domain consistency is achieved by aligning identifiers across all domains:
- the same `user_id` always refers to the same individual user, and
- the same `brand_id` always refers to the same brand entity.
This alignment allows researchers to seamlessly link interactions from different services, enabling studies in transfer learning, cross-domain personalization, and multi-task modeling.
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/QG0DavvcvccN1GcN_gRL6.png" style="max-width: 80%; height: auto;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/s8a8iC4RmUjsD_hzOVPvD.png" style="max-width: 80%; height: auto;">
---
### 📂 Data Schema
The dataset is stored in **Parquet** format with daily partitions (`{day}`).
The directory structure is as follows:
```
t-ecd/
├── users.pq
├── brands.pq
├── marketplace/
│ ├── events/{day}.pq
│ └── items.pq
├── retail/
│ ├── events/{day}.pq
│ └── items.pq
├── payments/
│ ├── events/{day}.pq
│ └── receipts/{day}.pq
├── offers/
│ ├── events/{day}.pq
│ └── items.pq
└── reviews/{day}.pq
```
#### Data availability
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/c2Clc9bNxL9i7jgGBfBq2.png" style="max-width: 80%; height: auto;" alt="Temporal distribution of events over domains">
*Temporal distribution of events over domains*
In line with real-world industrial environments, domain-specific data availability varies in historical depth.
This reflects practical constraints including data retention policies and product lifecycle stages -
newer e-commerce services naturally have shorter histories compared to established banking domains like payments and transactions.
### ⚙️ Events and Catalogs
- **Events**: Each domain provides logs of user interactions with the following possible columns:
- `action_type` — interaction type (e.g., view, click, add-to-cart, order, transaction).
- `subdomain` — surface where the interaction occurred (recommendations, catalog, search, checkout, campaign); available in Marketplace and Retail.
- `item_id` — present in Marketplace, Retail, and Offers; identifies a specific product or offer.
- `brand_id` — present in all domains; denotes the seller, store, or partner associated with an item, offer, or transaction.
- `price` — represents the monetary value of the interaction.
- `count` — represents the amount of items in single interaction.
- `os` — user operating system, available in Marketplace and Retail.
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Q7aeb_I-Yf-rcqyPDTOLa.png" style="max-width: 80%; height: auto;" >
- **Item catalogs (`items.pq`)**: Available for Marketplace, Retail, and Offers. Each entry includes:
- `item_id`
- `brand_id`
- category information (if available)
- pretrained embedding (if available)
- **User catalog (`users.pq`)**: Contains anonymized user attributes such as region and socio-demographic cluster.
- **Brand catalog (`brands.pq`)**: Contains `brand_id`, brand-level metadata, and embeddings.
#### 🧾 Special Structures
- **Receipts (`payments/receipts/{day}.pq`)**:
Some transactions include detailed receipts with purchased items, their quantities, and prices.
Items are aligned with Marketplace and Retail catalogs, enabling fine-grained cross-domain linkage at the product level.
- **Reviews (`reviews/{day}.pq`)**:
Provide explicit ratings per brand.
Raw text reviews are not included; instead, we release pretrained text embeddings to preserve privacy while enabling multimodal research.
---
### 🛠️ Data Collection
T-ECD was generated through a multi-step process:
1. **Sampling of event chains**: sequences of interactions were sampled from real logs of T-Bank ecosystem services.
2. **Anonymization**: user and brand identifiers were pseudonymized; sensitive attributes removed.
3. **Synthetic generation**: based on real distributions and event patterns, new synthetic interaction chains were produced, preserving structural properties such as sparsity, heavy tails, cross-domain overlaps, and behavioral contexts.
This process ensures that the dataset is privacy-preserving while remaining representative of industrial recommender system data.
## ⚠️ Important Note on Temporal Data Usage
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/zaPAcuD3CItTzP2PBkErs.png" style="max-width: 80%; height: auto;">
**To prevent data leakage, events from the final 12 hours should not be used for prediction tasks.**
The dataset contains temporal noise that requires maintaining a minimum 12-hour gap between the timestamp of the most recent user event and the prediction timestamp.
This constraint applies to both training and testing scenarios to avoid temporal data leakage.
## Download
#### Basic Download
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="t-tech/T-ECD",
repo_type="dataset",
allow_patterns="dataset/full/",
local_dir="./t_ecd_data",
token="<your_hf_token>"
)
```
#### Selective Download
For advanced usage including selection of domains and date ranges we provide custom downloader [tecd_downloader.py](https://huggingface.co/datasets/t-tech/T-ECD/blob/main/tecd_downloader.py)
Example usage:
```python
from tecd_downloader import download_dataset
download_dataset(
token="<your_hf_token>",
dataset_path="dataset/small",
local_dir="t_ecd_small_partial",
domains=["retail", "marketplace"],
day_begin=1300,
day_end=1308,
max_workers=10
)
```
---
### 🔐 License
This dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) licence
---
# T-ECD:T-Tech电商跨域数据集

⭐️ **T-ECD** 是由T-Bank的推荐系统(RecSys)研发团队打造的大规模合成跨域数据集,用于推荐系统领域的研究。该数据集通过合成生成技术实现全匿名化处理,同时捕捉多领域下的真实电商交互模式。
🎯 数据集概览
T-ECD涵盖了银行生态体系内五个不同电商领域的用户交互行为:
- **电商商城(Marketplace)**:用户在电商商城内浏览并与商品进行交互的行为。
- **零售配送(Retail)**:零售配送服务内的交互行为,包括加购商品与完成订单的操作。
- **支付交易(Payments)**:用户与品牌之间的线上及线下金融交易行为。
- **营销活动(Offers)**:用户对推广内容的响应行为,包括曝光、点击以及合作伙伴跳转操作。
- **用户评论(Reviews)**:以评分和文本评论嵌入向量(embedding)形式呈现的显性用户反馈。
**数据集规模:**
- 约**1350亿**次交互
- 约4400万用户
- 约3000万商品
- 覆盖时长超**1300天**
此外,我们还推出了**T-ECD精简版(T-ECD Small)**:该精简版本不含支付交易领域,仅包含约10亿次交互。
| 指标 | T-ECD精简版 | T-ECD完整版 |
|--------|-------------|------------|
| 🔄 **交互次数** | ~10亿 | **~1350亿** |
| 👥 **用户数** | ~350万 | **~4400万** |
| 📦 **商品数** | ~260万 | **~3000万** |
| 🏪 **品牌数** | ~2.9万 | **~100万** |
| 📅 **覆盖时长** | 200+天 | **1300+天** |
| 🌐 **覆盖领域** | 4个(不含支付交易) | **5个(全领域)** |
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Y3hHv_cipdq2p4A9jiQoz.png" style="max-width: 80%; height: auto;">
跨域一致性通过对齐各领域的标识符实现:
- 相同的`user_id`始终对应同一用户个体
- 相同的`brand_id`始终对应同一品牌实体
这种对齐方式可让研究人员无缝关联不同服务的交互数据,支持迁移学习、跨域个性化以及多任务建模等方向的研究。
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/QG0DavvcvccN1GcN_gRL6.png" style="max-width: 80%; height: auto;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/s8a8iC4RmUjsD_hzOVPvD.png" style="max-width: 80%; height: auto;">
---
### 📂 数据模式(Data Schema)
数据集以**Parquet(列式存储格式)**格式存储,并按日分区(`{day}`)进行组织。目录结构如下:
t-ecd/
├── users.pq
├── brands.pq
├── marketplace/
│ ├── events/{day}.pq
│ └── items.pq
├── retail/
│ ├── events/{day}.pq
│ └── items.pq
├── payments/
│ ├── events/{day}.pq
│ └── receipts/{day}.pq
├── offers/
│ ├── events/{day}.pq
│ └── items.pq
└── reviews/{day}.pq
#### 数据可用性
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/c2Clc9bNxL9i7jgGBfBq2.png" style="max-width: 80%; height: auto;" alt="Temporal distribution of events over domains">
*各领域事件的时间分布*
与真实工业场景一致,各领域数据的历史覆盖深度存在差异。这反映了实际约束条件,包括数据留存政策与产品生命周期阶段:相较于支付交易等成熟的银行生态领域,新兴电商服务的历史数据时长自然更短。
### ⚙️ 事件与目录数据
- **事件数据**:各领域均提供用户交互日志,其可能包含以下字段:
- `action_type`:交互类型(例如浏览、点击、加购、下单、交易等)。
- `subdomain`:交互发生的场景(包括推荐位、商品目录、搜索、结账页、营销活动等);仅在电商商城与零售配送领域可用。
- `item_id`:在电商商城、零售配送与营销活动领域均存在,用于标识特定商品或营销活动。
- `brand_id`:全领域均存在,用于标识与商品、营销活动或交易相关的卖家、店铺或合作伙伴。
- `price`:表示本次交互的交易金额。
- `count`:表示单次交互中的商品数量。
- `os`:用户操作系统;仅在电商商城与零售配送领域可用。
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Q7aeb_I-Yf-rcqyPDTOLa.png" style="max-width: 80%; height: auto;" >
- **商品目录(`items.pq`)**:在电商商城、零售配送与营销活动领域均可用,每条记录包含以下内容:
- `item_id`
- `brand_id`
- 分类信息(若可用)
- 预训练嵌入向量(若可用)
- **用户目录(`users.pq`)**:包含经过匿名化处理的用户属性,例如地区与社会人口学聚类标签。
- **品牌目录(`brands.pq`)**:包含`brand_id`、品牌元数据以及嵌入向量。
#### 🧾 特殊数据结构
- **交易凭证(`payments/receipts/{day}.pq`)**:
部分交易包含详细的交易凭证,记录了购买商品、商品数量与价格。商品与电商商城及零售配送领域的目录数据对齐,可实现细粒度的跨域商品级关联。
- **用户评论(`reviews/{day}.pq`)**:
提供针对品牌的显性评分。原始文本评论未公开,我们仅发布预训练文本嵌入向量以保护用户隐私,同时支持多模态研究。
---
### 🛠️ 数据采集流程
T-ECD通过多阶段流程生成:
1. **事件链采样**:从T-Bank生态服务的真实日志中采样交互序列。
2. **匿名化处理**:对用户与品牌标识符进行假名化处理,移除敏感属性。
3. **合成数据生成**:基于真实数据分布与交互模式生成全新的合成交互序列,保留数据的结构特性,例如稀疏性、重尾分布、跨域重叠性以及行为上下文。
该流程确保数据集在保护用户隐私的同时,能够真实反映工业级推荐系统数据的特性。
## ⚠️ 时序数据使用注意事项
<img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/zaPAcuD3CItTzP2PBkErs.png" style="max-width: 80%; height: auto;">
**为防止数据泄露,请勿使用最后12小时内的事件数据开展预测任务。**
数据集包含时序噪声,因此需要确保最新用户事件的时间戳与预测时间戳之间至少间隔12小时。
该约束适用于训练与测试全流程,以避免时序数据泄露。
## 数据集下载
#### 基础下载
python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="t-tech/T-ECD",
repo_type="dataset",
allow_patterns="dataset/full/",
local_dir="./t_ecd_data",
token="<your_hf_token>"
)
#### 选择性下载
针对需要选择领域与时间范围的高级使用场景,我们提供了自定义下载工具 [tecd_downloader.py](https://huggingface.co/datasets/t-tech/T-ECD/blob/main/tecd_downloader.py)
示例用法:
python
from tecd_downloader import download_dataset
download_dataset(
token="<your_hf_token>",
dataset_path="dataset/small",
local_dir="t_ecd_small_partial",
domains=["retail", "marketplace"],
day_begin=1300,
day_end=1308,
max_workers=10
)
---
### 🔐 开源许可
本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(CC BY-NC-SA 4.0)进行发布。
提供机构:
maas
创建时间:
2025-10-04



