five

luffy9/sql

收藏
Hugging Face2026-03-26 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/luffy9/sql
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 tags: - recsys - e-commerce - retrieval - dataset - ranking - cross-domain language: - ru - en size_categories: - 100B<n<1T pretty_name: T-ECD --- # T-ECD: T-Tech E-commerce Cross-Domain Dataset ![image (2)](https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/t8w2QTQbnH0DhEIbJYCnT.png) ⭐️ **T-ECD** is a large-scale synthetic cross-domain dataset for recommender systems research, created by T-Bank's RecSys R&D team. It captures real-world e-commerce interaction patterns across multiple domains while ensuring complete anonymity through synthetic generation. 🎯 Overview T-ECD represents user interactions across five different e-commerce domains within a banking ecosystem: - **Marketplace** — browsing and interacting with items in an e-commerce marketplace. - **Retail** — interactions within a retail delivery service, including cart additions and completed orders. - **Payments** — online and offline financial transactions between users and brands. - **Offers** — responses to promotional content such as impressions, clicks, and partner transitions. - **Reviews** — explicit user feedback in the form of ratings and embeddings of textual comments. **Scale:** - **~135B** interactions - ~44M users - ~30M items - **1300+ days of temporal coverage** Additionally, we provide **T-ECD Small** - a compact version containing 1B interactions that excludes the Payments domain. <div style="font-size: 1.1em;"> | Metric | T-ECD Small | T-ECD Full | |--------|-------------|------------| | 🔄 **Interactions** | ~1B | **~135B** | | 👥 **Users** | ~3.5M | **~44M** | | 📦 **Items** | ~2.6M | **~30M** | | 🏪 **Brands** | ~29K | **~1M** | | 📅 **Temporal Coverage** | 200+ days | **1300+ days** | | 🌐 **Domains** | 4 (excl. Payments) | **5 (all domains)** | </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Y3hHv_cipdq2p4A9jiQoz.png" style="max-width: 80%; height: auto;"> Cross-domain consistency is achieved by aligning identifiers across all domains: - the same `user_id` always refers to the same individual user, and - the same `brand_id` always refers to the same brand entity. This alignment allows researchers to seamlessly link interactions from different services, enabling studies in transfer learning, cross-domain personalization, and multi-task modeling. <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/QG0DavvcvccN1GcN_gRL6.png" style="max-width: 80%; height: auto;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/s8a8iC4RmUjsD_hzOVPvD.png" style="max-width: 80%; height: auto;"> --- ### 📂 Data Schema The dataset is stored in **Parquet** format with daily partitions (`{day}`). The directory structure is as follows: ``` t-ecd/ ├── users.pq ├── brands.pq ├── marketplace/ │ ├── events/{day}.pq │ └── items.pq ├── retail/ │ ├── events/{day}.pq │ └── items.pq ├── payments/ │ ├── events/{day}.pq │ └── receipts/{day}.pq ├── offers/ │ ├── events/{day}.pq │ └── items.pq └── reviews/{day}.pq ``` #### Data availability <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/c2Clc9bNxL9i7jgGBfBq2.png" style="max-width: 80%; height: auto;" alt="Temporal distribution of events over domains"> *Temporal distribution of events over domains* In line with real-world industrial environments, domain-specific data availability varies in historical depth. This reflects practical constraints including data retention policies and product lifecycle stages - newer e-commerce services naturally have shorter histories compared to established banking domains like payments and transactions. ### ⚙️ Events and Catalogs - **Events**: Each domain provides logs of user interactions with the following possible columns: - `action_type` — interaction type (e.g., view, click, add-to-cart, order, transaction). - `subdomain` — surface where the interaction occurred (recommendations, catalog, search, checkout, campaign); available in Marketplace and Retail. - `item_id` — present in Marketplace, Retail, and Offers; identifies a specific product or offer. - `brand_id` — present in all domains; denotes the seller, store, or partner associated with an item, offer, or transaction. - `price` — represents the monetary value of the interaction. - `count` — represents the amount of items in single interaction. - `os` — user operating system, available in Marketplace and Retail. <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/Q7aeb_I-Yf-rcqyPDTOLa.png" style="max-width: 80%; height: auto;" > - **Item catalogs (`items.pq`)**: Available for Marketplace, Retail, and Offers. Each entry includes: - `item_id` - `brand_id` - category information (if available) - pretrained embedding (if available) - **User catalog (`users.pq`)**: Contains anonymized user attributes such as region and socio-demographic cluster. - **Brand catalog (`brands.pq`)**: Contains `brand_id`, brand-level metadata, and embeddings. #### 🧾 Special Structures - **Receipts (`payments/receipts/{day}.pq`)**: Some transactions include detailed receipts with purchased items, their quantities, and prices. Items are aligned with Marketplace and Retail catalogs, enabling fine-grained cross-domain linkage at the product level. - **Reviews (`reviews/{day}.pq`)**: Provide explicit ratings per brand. Raw text reviews are not included; instead, we release pretrained text embeddings to preserve privacy while enabling multimodal research. --- ### 🛠️ Data Collection T-ECD was generated through a multi-step process: 1. **Sampling of event chains**: sequences of interactions were sampled from real logs of T-Bank ecosystem services. 2. **Anonymization**: user and brand identifiers were pseudonymized; sensitive attributes removed. 3. **Synthetic generation**: based on real distributions and event patterns, new synthetic interaction chains were produced, preserving structural properties such as sparsity, heavy tails, cross-domain overlaps, and behavioral contexts. This process ensures that the dataset is privacy-preserving while remaining representative of industrial recommender system data. ## ⚠️ Important Note on Temporal Data Usage <img src="https://cdn-uploads.huggingface.co/production/uploads/645d4947f5760d1530d55023/zaPAcuD3CItTzP2PBkErs.png" style="max-width: 80%; height: auto;"> **To prevent data leakage, events from the final 12 hours should not be used for prediction tasks.** The dataset contains temporal noise that requires maintaining a minimum 12-hour gap between the timestamp of the most recent user event and the prediction timestamp. This constraint applies to both training and testing scenarios to avoid temporal data leakage. ## Download #### Basic Download ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="t-tech/T-ECD", repo_type="dataset", allow_patterns="dataset/full/", local_dir="./t_ecd_data", token="<your_hf_token>" ) ``` #### Selective Download For advanced usage including selection of domains and date ranges we provide custom downloader [tecd_downloader.py](https://huggingface.co/datasets/t-tech/T-ECD/blob/main/tecd_downloader.py) Example usage: ```python from tecd_downloader import download_dataset download_dataset( token="<your_hf_token>", dataset_path="dataset/small", local_dir="t_ecd_small_partial", domains=["retail", "marketplace"], day_begin=1300, day_end=1308, max_workers=10 ) ``` --- ### 🔐 License This dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) licence ---
提供机构:
luffy9
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作