lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- kl-divergence
- distillation
- swe-bench
- swe-gym
- imitation-learning
- teacher-rollouts
size_categories:
- n<1K
---
# Qwen3-Coder-30B Teacher Rollouts on SWE-Gym Train-Eval-100 (Diverse-Sampled)
Per-task teacher rollouts from **Qwen3-Coder-30B-A3B-Instruct** on the
100-task `swe_gym_train_eval_100` held-out evaluation split, dumped with
**diverse sampling** (T=0.7, top_p=0.9) so they can be used as a fixed
reference distribution for offline KL / JSD divergence evaluation of
4B-class students that share the Qwen3 tokenizer.
This cache was used to compare a vanilla `Qwen3-4B-Instruct-2507`
checkpoint against several SFT-distilled iterations (`sft-iter0/1/2/3`)
in the Online DAgger paper experiments.
## What's in this repo
```
swe_gym_train_eval_100/
<instance_id>.pt # 100 files, one per task, ~200-500 KB each
manifest.json # listing of all (dataset, instance_id, path)
```
Each `.pt` is a Python dict (loadable with `torch.load(path, map_location="cpu")`):
| Key | Type | Shape | Description |
|---|---|---|---|
| `instance_id` | `str` | — | SWE-Gym instance id (e.g. `getmoto__moto-7168`) |
| `dataset` | `str` | — | always `swe_gym_train_eval_100` |
| `tokens` | `list[int]` | (P+R,) | full sequence: prompt + teacher response |
| `prompt_len` | `int` | — | length of the prompt portion `P` |
| `response_length` | `int` | — | length of the response portion `R` |
| `loss_mask` | `list[int]` | (R,) | 0/1 — which response positions are assistant-generated (the only positions that should contribute to KL) |
| `rollout_log_probs` | `list[float]` | (R,) | teacher's per-token log-prob `log p(y_t | y_<t, x)` for each response position |
| `reward_score` | `float` | — | task reward (1.0 = solved, else 0.0) |
| `reward_extra` | `dict` | — | auxiliary reward fields if any |
| `metadata` | `dict` | — | `{instance_id, _eval_dataset_name, task_type, _termination_reason, actions_used, _sample_idx?}` |
## How the rollouts were generated
- **Teacher**: `Qwen/Qwen3-Coder-30B-A3B-Instruct` served via SGLang (TP=4)
- **Sampling**: T=0.7, top_p=0.9 (diverse — distinguishes student/teacher distributions even on tasks where both arrive at similar greedy outputs)
- **Agent loop**: rich-info SWE agent (`swe_agent` from `stacx_eval_rich_info`) with rock-managed SWE-Bench sandboxes
- **Per-token log-probs**: extracted directly from SGLang `/generate` with `return_logprob=True`
- **Tokenizer**: shared with all Qwen3-4B-Instruct variants — these cached `tokens` are valid token ids for any 4B-Instruct student's embedding matrix
Teacher per-task accuracy on this set: **30/100 (30 %)**. Failure
breakdown: 30 solved, 48 finish_unresolved, 19 ctx_overflow, 1 budget,
2 other.
## Using this cache for KL/JSD evaluation
The `rl_engine.rollout.kl_eval_compute` module in
[`stacx_eval_rich_info`](https://github.com/lichangh20/stacx_eval_rich_info)
loads each `.pt` and computes:
- **KL_fwd** = `KL(student || teacher)` — uses `student_logprobs` (force-decoded by the student on `tokens`) vs cached `rollout_log_probs`
- **KL_back** = `KL(teacher || student)` via fresh student rollouts on the same prompt + cached teacher responses
- **JSD** = `0.5·KL_fwd + 0.5·KL_back` after mixing log-probs (Schulman K3 estimator)
Quick-start:
```bash
huggingface-cli download lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache \
--repo-type dataset \
--local-dir /path/to/kl_cache
# Then in your eval script:
export KL_EVAL_TEACHER_CACHE_DIR=/path/to/kl_cache
# Eval will pick up the cache and emit kl_fwd_mean / kl_back_mean / jsd_mean
# alongside the usual rollout accuracy metrics.
```
Or load a single .pt directly:
```python
import torch
record = torch.load("kl_cache/swe_gym_train_eval_100/getmoto__moto-7168.pt",
map_location="cpu")
print(record["tokens"][:5], record["rollout_log_probs"][:5])
```
## Compatibility
- **Tokenizer**: Qwen3 vocab (151k). Compatible students:
- `Qwen/Qwen3-4B-Instruct-2507`
- `lichangh20/qwen3-4b-instruct-sft-swegym-iter{0,1,2,3}` (and other Qwen3-4B SFT variants)
- **Dataset**: 100 instance_ids drawn from SWE-Gym training set, held out from training. Same prompts as the `swe_gym_train_eval_100.jsonl` in [`lichangh20/stacx-swe-online-dagger-data`](https://huggingface.co/datasets/lichangh20/stacx-swe-online-dagger-data).
## Citation
If you use this cache, please cite the Online DAgger paper (NeurIPS 2026 submission, in preparation).
提供机构:
lichangh20



