oristides/meli-sessions-enriched
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/oristides/meli-sessions-enriched
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: item_ids
list: int64
- name: event_types
list: int64
- name: domain_idx
list: int64
- name: category_idx
list: int64
- name: product_idx
list: int64
- name: condition_idx
list: int64
- name: price
list: float64
- name: position
list: int64
- name: time_gap
list: float64
- name: day_period
list: int64
- name: is_repeat
list: int64
- name: target_item
dtype: int64
- name: first_event_timestamp
dtype: timestamp[ns, tz=-04:00]
- name: last_event_timestamp
dtype: timestamp[ns, tz=-04:00]
splits:
- name: train
num_bytes: 1084618701
num_examples: 413163
- name: test
num_bytes: 468797047
num_examples: 177070
download_size: 232331848
dataset_size: 1553415748
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: apache-2.0
pretty_name: m
---
# Meli Sessions Enriched
Session-based recommendation dataset built from the
[Meli Data Challenge 2020](https://www.kaggle.com/datasets/marlesson/meli-data-challenge-2020).
Each row in this dataset is **one user session**, constructed from the original `user_history` field
(and, for train, the `item_bought` field). The goal is to provide a clean, time-aware session dataset
for sequence modeling and next-item prediction.
---
## Source
- Original data: **Meli Data Challenge 2020** on Kaggle
- Author: `marlesson`
- Files used: `train_dataset.jl`, `test_dataset.jl`, `item_data.jl`
- This dataset: preprocessed version with:
- One row per session.
- Numeric encodings for item metadata.
- Time-based features and repeat-visit flags.
Please refer to the original Kaggle page for the underlying license and competition details.
---
## Splits
This dataset has four splits:
- `train`: sessions with a known purchased item (`target_item`).
- `test`: sessions **without** purchase labels in the original data
(`target_item` is set to `0` as a placeholder).
- `train_meta`: per-chunk metadata for the `train` Parquet parts.
- `test_meta`: per-chunk metadata for the `test` Parquet parts.
Typical cardinalities (depending on preprocessing version):
- `train`: ~410k sessions
- `test`: ~177k sessions
Each session can have a variable number of events.
---
## Column description (sessions)
All list-valued columns have **one element per event in the session**, ordered by timestamp.
### Core sequence columns
- **`item_ids`** – `list[int]`
Item ID for each event.
- For `view` events: the numeric item ID.
- For `search` events: `0` (no specific item is attached).
- **`event_types`** – `list[int]`
Encoded event type per step:
- `0` = view (item page view)
- `1` = search (search query event)
- **`domain_idx`** – `list[int]`
Encoded domain index for the item at each step (derived from `domain_id` in `item_data.jl`).
- **`category_idx`** – `list[int]`
Encoded category index for the item at each step (derived from `category_id`).
- **`product_idx`** – `list[int]`
Encoded product index for the item at each step (derived from `product_id`, with missing values
mapped to a special category).
- **`condition_idx`** – `list[int]`
Encoded condition index for the item at each step (derived from textual condition like `new`, `used`).
- **`price`** – `list[float]`
Log-transformed price per event:
- `price[t] = log1p(original_price)` for item events.
- `0.0` for non-item events (e.g. search).
### Behavioral / temporal features
- **`position`** – `list[int]`
Zero-based position of each event within the session:
- `0, 1, 2, ..., len(session) - 1`
- **`time_gap`** – `list[float]`
Time in **seconds** since the previous event in the same session:
- First event: `0`
- Subsequent events: `(timestamp[t] - timestamp[t-1]).total_seconds()`
- **`day_period`** – `list[int]`
Discrete time-of-day bucket for each event, based on the event’s hour:
- `0` = morning (05:00–12:00)
- `1` = afternoon (12:00–17:00)
- `2` = evening (17:00–21:00)
- `3` = night (21:00–05:00)
- **`is_repeat`** – `list[int]`
Indicates whether the item at a step has been seen **earlier in the same session**:
- `1` = item already appeared before in this session
- `0` = first time this item appears in this session
(search events or `item_id == 0` are treated as non-items.)
### Target and timestamps
- **`target_item`** – `int`
- In **train**: the ID of the purchased item (`item_bought` in the original data).
- In **test**: always `0` (no purchase labels are provided in the original file).
- **`first_event_timestamp`** – `timestamp`
Datetime of the **first** event in the session (after sorting).
- **`last_event_timestamp`** – `timestamp`
Datetime of the **last** event in the session.
This is useful for:
- Time-based splits (using earlier sessions for training, later for validation/testing).
- Popularity or recency-based baselines that must avoid leaking future information.
---
## Chunk metadata (`train_meta`, `test_meta`)
These splits describe the Parquet parts used to build the dataset. Each row corresponds to one
`part_*.parquet` file in the original processing pipeline.
Columns:
- **`chunk_idx`** – `int`
Numeric index of the chunk (matches file name `part_<chunk_idx>.parquet`).
- **`path`** – `string`
File name of the chunk, e.g. `part_0.parquet`.
- **`n_sessions`** – `int`
Number of sessions in this chunk.
- **`min_last_event_timestamp`** – `timestamp`
Earliest `last_event_timestamp` in the chunk.
- **`max_last_event_timestamp`** – `timestamp`
Latest `last_event_timestamp` in the chunk.
This metadata allows reconstructing approximate temporal ordering of chunks and
building time-aware evaluation schemes (e.g. using earlier chunks for training and later chunks for validation).
---
## Example usage
```python
from datasets import load_dataset
ds = load_dataset("oristides/meli-sessions-enriched")
train = ds["train"]
test = ds["test"]
train_meta = ds["train_meta"]
print(train[0]["item_ids"], train[0]["target_item"])
print(train_meta[0])
提供机构:
oristides



