arvindcr4/tinker-rl-bench-wandb
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/arvindcr4/tinker-rl-bench-wandb
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- reinforcement-learning
tags:
- wandb
- rlhf
- grpo
- ppo
- reinforcement-learning-from-human-feedback
- training-logs
language:
- en
pretty_name: TinkerRL-Bench W&B Run Archive
size_categories:
- 1K<n<10K
configs:
- config_name: runs
data_files: runs.jsonl
- config_name: history
data_files: history.jsonl
---
# TinkerRL-Bench W&B Run Archive
Full export of every Weights & Biases run under the `arvindcr4-pes-university`
entity, covering the experiments reported in our NeurIPS submission
*"A Unified Benchmark for RL Post-Training of Language Models"*
([repo](https://github.com/pes-llm-research/tinker-rl-lab)).
## Contents
| File | Rows | Description |
|------|------|-------------|
| `runs.jsonl` | 334 | One record per run: `project`, `run_id`, `run_name`, `state`, `config`, `summary`, `tags`, `url`, `runtime` |
| `history.jsonl` | 9,255 | Per-step metric history (step, reward, loss, accuracy, etc.) joined to `run_id` |
## Projects covered
| Project | Runs | What it contains |
|---|---|---|
| `tinker-rl-lab-world-class` | 171 | Frontier/architectural GSM8K campaigns (Kimi-K2, GPT-OSS-20B, Qwen3-235B, DeepSeek-V3.1, Nemotron-120B, Llama-8B-Instruct, MoE variants) |
| `tinker-structural-ceiling` | 72 | Structural-ceiling sweep across Qwen3 / Llama / Gemma base + instruct, learning-rate and group-size ablations |
| `tinker-rl-scaling` | 88 | Scaling / seed ablations of Qwen3 {0.6B, 1.7B, 4B, 8B, 14B, 30B-MoE} on GSM8K |
| `skyrl-tinker` | 3 | Qwen3-8B tool-use SkyRL runs |
## How to load
```python
from datasets import load_dataset
runs = load_dataset("arvindcr4/tinker-rl-bench-wandb", "runs", split="train")
history = load_dataset("arvindcr4/tinker-rl-bench-wandb", "history", split="train")
# e.g. last-10-avg reward for each finished run
import pandas as pd
df_h = history.to_pandas()
df_h = df_h.dropna(subset=["_step", "reward"])
per_run = df_h.sort_values("_step").groupby("run_id").tail(10) \
.groupby("run_id")["reward"].mean()
```
## Citation
```bibtex
@misc{tinkerrlbench2026,
title = {A Unified Benchmark for RL Post-Training of Language Models},
author = {Arvind, C. R. and Jeyaraj, Sandhya},
year = {2026},
note = {NeurIPS submission, see https://github.com/pes-llm-research/tinker-rl-lab}
}
```
## License
Apache 2.0.
提供机构:
arvindcr4



