siddharthmb/2026.transcoder-adapters.lmsys-chat-1m-splits

Name: siddharthmb/2026.transcoder-adapters.lmsys-chat-1m-splits
Creator: siddharthmb
Published: 2026-03-04 23:24:24
License: 暂无描述

Hugging Face2026-03-04 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/siddharthmb/2026.transcoder-adapters.lmsys-chat-1m-splits

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: other source_datasets: - lmsys/lmsys-chat-1m tags: - lmsys - chat - train-val-split --- # LMSYS-Chat Train/Val Split Derived from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m). ## Methodology This dataset was created by **excluding** all LMSYS rows that were used in a prior training run, then splitting the remaining rows into train and val sets. ### How training rows were identified 1. **MixedDataset interleaving** (seed=80): The original training mixed [science-of-finetuning/fineweb-1m-sample](https://huggingface.co/datasets/science-of-finetuning/fineweb-1m-sample) and [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) with equal 50/50 weights using `torch.multinomial` + per-dataset `torch.randperm` pools. 2. **Subsample** (seed=81): From the mixed dataset of 1,900,000 total rows, 10,000 were selected via `torch.randperm(total)[:dataset_rows]`. 3. **LMSYS extraction**: Of those 10,000 mixed rows, 5,042 mapped to LMSYS indices. These are excluded from both splits. ### Split construction The 994,958 unused LMSYS rows were shuffled (seed=42) and partitioned: - **val**: first 100,000 rows - **train**: remaining 900,000 rows ## Splits | Split | Rows | |-------|------| | train | 900,000 | | val | 100,000 | ## Reproduction parameters ```json { "source": "lmsys/lmsys-chat-1m", "fineweb_path": "science-of-finetuning/fineweb-1m-sample", "mixed_seed": 80, "dataloader_seed": 81, "dataset_rows": 10000, "split_seed": 42, "val_size": 100000, "n_lmsys_used_in_training": 5042, "n_train": 900000, "n_val": 100000 } ```

提供机构：

siddharthmb

5,000+

优质数据集

54 个

任务类型

进入经典数据集