siddharthmb/2026.transcoder-adapters.lmsys-chat-1m-splits
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/siddharthmb/2026.transcoder-adapters.lmsys-chat-1m-splits
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
source_datasets:
- lmsys/lmsys-chat-1m
tags:
- lmsys
- chat
- train-val-split
---
# LMSYS-Chat Train/Val Split
Derived from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m).
## Methodology
This dataset was created by **excluding** all LMSYS rows that were used in a
prior training run, then splitting the remaining rows into train and val sets.
### How training rows were identified
1. **MixedDataset interleaving** (seed=80): The original training
mixed [science-of-finetuning/fineweb-1m-sample](https://huggingface.co/datasets/science-of-finetuning/fineweb-1m-sample)
and [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
with equal 50/50 weights using `torch.multinomial` + per-dataset `torch.randperm`
pools.
2. **Subsample** (seed=81): From the mixed dataset of
1,900,000 total rows, 10,000 were selected via
`torch.randperm(total)[:dataset_rows]`.
3. **LMSYS extraction**: Of those 10,000 mixed rows,
5,042 mapped to LMSYS indices. These are excluded from both splits.
### Split construction
The 994,958 unused LMSYS rows were shuffled
(seed=42) and partitioned:
- **val**: first 100,000 rows
- **train**: remaining 900,000 rows
## Splits
| Split | Rows |
|-------|------|
| train | 900,000 |
| val | 100,000 |
## Reproduction parameters
```json
{
"source": "lmsys/lmsys-chat-1m",
"fineweb_path": "science-of-finetuning/fineweb-1m-sample",
"mixed_seed": 80,
"dataloader_seed": 81,
"dataset_rows": 10000,
"split_seed": 42,
"val_size": 100000,
"n_lmsys_used_in_training": 5042,
"n_train": 900000,
"n_val": 100000
}
```
提供机构:
siddharthmb



