five

erickfm/mimic-melee

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/erickfm/mimic-melee
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc0-1.0 tags: - melee - smash-bros - slippi - imitation-learning - controller-inputs - fighting-games - pytorch pretty_name: MIMIC Melee size_categories: - 100K<n<1M --- # MIMIC Melee Pretokenized tensor shards for training [MIMIC](https://github.com/erickfm/MIMIC), an imitation-learning bot for Super Smash Bros. Melee. Each shard is a ready-to-train PyTorch file containing normalized game-state features and controller-input targets — no preprocessing needed at load time. ## Source Built from [slippi-public-dataset-v3.7](https://huggingface.co/datasets/erickfm/slippi-public-dataset-v3.7) (~95,102 raw Slippi tournament replays, compiled by **altf4** on the [Slippi Discord](https://discord.gg/slippi), CC0 licensed). Raw `.slp` replays are converted to per-frame parquet files using [slippi-frame-extractor](https://github.com/erickfm/slippi-frame-extractor), then tensorized and uploaded via `tools/upload_dataset.py` in streaming mode with 64 multiprocessing workers. ### Replay to game expansion Each replay contains two players. Every replay is tensorized from **both players' perspectives**, doubling the training data: | | Replays | Games (2x perspectives) | |---|---|---| | Train | ~84,788 | 169,575 | | Val | ~9,421 | 18,841 | | **Total** | **~94,208** | **188,416** | ## Dataset statistics | Split | Games | Frames | Shards | |-------|-------|--------|--------| | Train | 169,575 | 1,631,777,124 | 582 | | Val | 18,841 | 180,971,668 | 65 | | **Total** | **188,416** | **1,812,748,792** | **647** | - **Total size:** 2.59 TB - **Shard size:** ~4 GB each - **Val split:** 10% (seed 42) - **Format:** Per-game concatenated tensors with offset arrays for dynamic windowing ## Shard format Each `.pt` file contains a dict: ```python { "states": {feature_name: Tensor}, # normalized game-state features "targets": {head_name: Tensor}, # controller-input targets "offsets": [int, ...], # game boundary indices along time axis "n_games": int, # number of games in this shard } ``` Multiple games are concatenated along the time axis (axis 0). The `offsets` array marks where each game begins, enabling dynamic windowing during training without pre-creating all sliding windows. ## Preprocessing All preprocessing is baked into the shards: - **Categorical encoding** via `cat_maps.json` (ports, costumes, action states, projectile subtypes) - **Normalization** via `norm_stats.json` (per-column z-score standardization) - **Stick discretization** via `stick_clusters.json` (30 K-means clusters for main stick, 4 bins for L/R triggers) - **C-stick** encoded as 5-way cardinal direction (neutral/up/down/left/right) - **Self-controller inputs excluded** — model learns purely from game state, eliminating train/inference distribution shift ## Metadata files | File | Description | |------|-------------| | `tensor_manifest.json` | Shard list, game counts, frame counts, train/val split | | `norm_stats.json` | Per-column mean and standard deviation | | `cat_maps.json` | Dynamic categorical mappings | | `stick_clusters.json` | K-means cluster centers for stick positions and shoulder triggers | ## Usage ```python from huggingface_hub import snapshot_download snapshot_download("erickfm/mimic-melee", repo_type="dataset", local_dir="data/full") ``` Or use the one-command setup: ```bash git clone https://github.com/erickfm/MIMIC && cd MIMIC bash setup.sh --run ``` For a smaller version for quick experiments, see [erickfm/mimic-melee-subset](https://huggingface.co/datasets/erickfm/mimic-melee-subset). ## Related - [MIMIC](https://github.com/erickfm/MIMIC) — Imitation-learning bot trained on this data - [slippi-public-dataset-v3.7](https://huggingface.co/datasets/erickfm/slippi-public-dataset-v3.7) — Raw source replays - [slippi-frame-extractor](https://github.com/erickfm/slippi-frame-extractor) — .slp to parquet converter - [Slippi](https://slippi.gg/) — Melee netplay client ## License CC0 1.0 — Public domain.
提供机构:
erickfm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作