five

oristides/meli-sessions-enriched

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/oristides/meli-sessions-enriched
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: item_ids list: int64 - name: event_types list: int64 - name: domain_idx list: int64 - name: category_idx list: int64 - name: product_idx list: int64 - name: condition_idx list: int64 - name: price list: float64 - name: position list: int64 - name: time_gap list: float64 - name: day_period list: int64 - name: is_repeat list: int64 - name: target_item dtype: int64 - name: first_event_timestamp dtype: timestamp[ns, tz=-04:00] - name: last_event_timestamp dtype: timestamp[ns, tz=-04:00] splits: - name: train num_bytes: 1084618701 num_examples: 413163 - name: test num_bytes: 468797047 num_examples: 177070 download_size: 232331848 dataset_size: 1553415748 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: apache-2.0 pretty_name: m --- # Meli Sessions Enriched Session-based recommendation dataset built from the [Meli Data Challenge 2020](https://www.kaggle.com/datasets/marlesson/meli-data-challenge-2020). Each row in this dataset is **one user session**, constructed from the original `user_history` field (and, for train, the `item_bought` field). The goal is to provide a clean, time-aware session dataset for sequence modeling and next-item prediction. --- ## Source - Original data: **Meli Data Challenge 2020** on Kaggle - Author: `marlesson` - Files used: `train_dataset.jl`, `test_dataset.jl`, `item_data.jl` - This dataset: preprocessed version with: - One row per session. - Numeric encodings for item metadata. - Time-based features and repeat-visit flags. Please refer to the original Kaggle page for the underlying license and competition details. --- ## Splits This dataset has four splits: - `train`: sessions with a known purchased item (`target_item`). - `test`: sessions **without** purchase labels in the original data (`target_item` is set to `0` as a placeholder). - `train_meta`: per-chunk metadata for the `train` Parquet parts. - `test_meta`: per-chunk metadata for the `test` Parquet parts. Typical cardinalities (depending on preprocessing version): - `train`: ~410k sessions - `test`: ~177k sessions Each session can have a variable number of events. --- ## Column description (sessions) All list-valued columns have **one element per event in the session**, ordered by timestamp. ### Core sequence columns - **`item_ids`** – `list[int]` Item ID for each event. - For `view` events: the numeric item ID. - For `search` events: `0` (no specific item is attached). - **`event_types`** – `list[int]` Encoded event type per step: - `0` = view (item page view) - `1` = search (search query event) - **`domain_idx`** – `list[int]` Encoded domain index for the item at each step (derived from `domain_id` in `item_data.jl`). - **`category_idx`** – `list[int]` Encoded category index for the item at each step (derived from `category_id`). - **`product_idx`** – `list[int]` Encoded product index for the item at each step (derived from `product_id`, with missing values mapped to a special category). - **`condition_idx`** – `list[int]` Encoded condition index for the item at each step (derived from textual condition like `new`, `used`). - **`price`** – `list[float]` Log-transformed price per event: - `price[t] = log1p(original_price)` for item events. - `0.0` for non-item events (e.g. search). ### Behavioral / temporal features - **`position`** – `list[int]` Zero-based position of each event within the session: - `0, 1, 2, ..., len(session) - 1` - **`time_gap`** – `list[float]` Time in **seconds** since the previous event in the same session: - First event: `0` - Subsequent events: `(timestamp[t] - timestamp[t-1]).total_seconds()` - **`day_period`** – `list[int]` Discrete time-of-day bucket for each event, based on the event’s hour: - `0` = morning (05:00–12:00) - `1` = afternoon (12:00–17:00) - `2` = evening (17:00–21:00) - `3` = night (21:00–05:00) - **`is_repeat`** – `list[int]` Indicates whether the item at a step has been seen **earlier in the same session**: - `1` = item already appeared before in this session - `0` = first time this item appears in this session (search events or `item_id == 0` are treated as non-items.) ### Target and timestamps - **`target_item`** – `int` - In **train**: the ID of the purchased item (`item_bought` in the original data). - In **test**: always `0` (no purchase labels are provided in the original file). - **`first_event_timestamp`** – `timestamp` Datetime of the **first** event in the session (after sorting). - **`last_event_timestamp`** – `timestamp` Datetime of the **last** event in the session. This is useful for: - Time-based splits (using earlier sessions for training, later for validation/testing). - Popularity or recency-based baselines that must avoid leaking future information. --- ## Chunk metadata (`train_meta`, `test_meta`) These splits describe the Parquet parts used to build the dataset. Each row corresponds to one `part_*.parquet` file in the original processing pipeline. Columns: - **`chunk_idx`** – `int` Numeric index of the chunk (matches file name `part_<chunk_idx>.parquet`). - **`path`** – `string` File name of the chunk, e.g. `part_0.parquet`. - **`n_sessions`** – `int` Number of sessions in this chunk. - **`min_last_event_timestamp`** – `timestamp` Earliest `last_event_timestamp` in the chunk. - **`max_last_event_timestamp`** – `timestamp` Latest `last_event_timestamp` in the chunk. This metadata allows reconstructing approximate temporal ordering of chunks and building time-aware evaluation schemes (e.g. using earlier chunks for training and later chunks for validation). --- ## Example usage ```python from datasets import load_dataset ds = load_dataset("oristides/meli-sessions-enriched") train = ds["train"] test = ds["test"] train_meta = ds["train_meta"] print(train[0]["item_ids"], train[0]["target_item"]) print(train_meta[0])
提供机构:
oristides
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作