lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache

Name: lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache
Creator: lichangh20
Published: 2026-04-28 02:12:20
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - kl-divergence - distillation - swe-bench - swe-gym - imitation-learning - teacher-rollouts size_categories: - n<1K --- # Qwen3-Coder-30B Teacher Rollouts on SWE-Gym Train-Eval-100 (Diverse-Sampled) Per-task teacher rollouts from **Qwen3-Coder-30B-A3B-Instruct** on the 100-task `swe_gym_train_eval_100` held-out evaluation split, dumped with **diverse sampling** (T=0.7, top_p=0.9) so they can be used as a fixed reference distribution for offline KL / JSD divergence evaluation of 4B-class students that share the Qwen3 tokenizer. This cache was used to compare a vanilla `Qwen3-4B-Instruct-2507` checkpoint against several SFT-distilled iterations (`sft-iter0/1/2/3`) in the Online DAgger paper experiments. ## What's in this repo ``` swe_gym_train_eval_100/ <instance_id>.pt # 100 files, one per task, ~200-500 KB each manifest.json # listing of all (dataset, instance_id, path) ``` Each `.pt` is a Python dict (loadable with `torch.load(path, map_location="cpu")`): | Key | Type | Shape | Description | |---|---|---|---| | `instance_id` | `str` | — | SWE-Gym instance id (e.g. `getmoto__moto-7168`) | | `dataset` | `str` | — | always `swe_gym_train_eval_100` | | `tokens` | `list[int]` | (P+R,) | full sequence: prompt + teacher response | | `prompt_len` | `int` | — | length of the prompt portion `P` | | `response_length` | `int` | — | length of the response portion `R` | | `loss_mask` | `list[int]` | (R,) | 0/1 — which response positions are assistant-generated (the only positions that should contribute to KL) | | `rollout_log_probs` | `list[float]` | (R,) | teacher's per-token log-prob `log p(y_t | y_<t, x)` for each response position | | `reward_score` | `float` | — | task reward (1.0 = solved, else 0.0) | | `reward_extra` | `dict` | — | auxiliary reward fields if any | | `metadata` | `dict` | — | `{instance_id, _eval_dataset_name, task_type, _termination_reason, actions_used, _sample_idx?}` | ## How the rollouts were generated - **Teacher**: `Qwen/Qwen3-Coder-30B-A3B-Instruct` served via SGLang (TP=4) - **Sampling**: T=0.7, top_p=0.9 (diverse — distinguishes student/teacher distributions even on tasks where both arrive at similar greedy outputs) - **Agent loop**: rich-info SWE agent (`swe_agent` from `stacx_eval_rich_info`) with rock-managed SWE-Bench sandboxes - **Per-token log-probs**: extracted directly from SGLang `/generate` with `return_logprob=True` - **Tokenizer**: shared with all Qwen3-4B-Instruct variants — these cached `tokens` are valid token ids for any 4B-Instruct student's embedding matrix Teacher per-task accuracy on this set: **30/100 (30 %)**. Failure breakdown: 30 solved, 48 finish_unresolved, 19 ctx_overflow, 1 budget, 2 other. ## Using this cache for KL/JSD evaluation The `rl_engine.rollout.kl_eval_compute` module in [`stacx_eval_rich_info`](https://github.com/lichangh20/stacx_eval_rich_info) loads each `.pt` and computes: - **KL_fwd** = `KL(student || teacher)` — uses `student_logprobs` (force-decoded by the student on `tokens`) vs cached `rollout_log_probs` - **KL_back** = `KL(teacher || student)` via fresh student rollouts on the same prompt + cached teacher responses - **JSD** = `0.5·KL_fwd + 0.5·KL_back` after mixing log-probs (Schulman K3 estimator) Quick-start: ```bash huggingface-cli download lichangh20/qwen3-coder-30b-swegym-train-eval-100-kl-cache \ --repo-type dataset \ --local-dir /path/to/kl_cache # Then in your eval script: export KL_EVAL_TEACHER_CACHE_DIR=/path/to/kl_cache # Eval will pick up the cache and emit kl_fwd_mean / kl_back_mean / jsd_mean # alongside the usual rollout accuracy metrics. ``` Or load a single .pt directly: ```python import torch record = torch.load("kl_cache/swe_gym_train_eval_100/getmoto__moto-7168.pt", map_location="cpu") print(record["tokens"][:5], record["rollout_log_probs"][:5]) ``` ## Compatibility - **Tokenizer**: Qwen3 vocab (151k). Compatible students: - `Qwen/Qwen3-4B-Instruct-2507` - `lichangh20/qwen3-4b-instruct-sft-swegym-iter{0,1,2,3}` (and other Qwen3-4B SFT variants) - **Dataset**: 100 instance_ids drawn from SWE-Gym training set, held out from training. Same prompts as the `swe_gym_train_eval_100.jsonl` in [`lichangh20/stacx-swe-online-dagger-data`](https://huggingface.co/datasets/lichangh20/stacx-swe-online-dagger-data). ## Citation If you use this cache, please cite the Online DAgger paper (NeurIPS 2026 submission, in preparation).

提供机构：

lichangh20

5,000+

优质数据集

54 个

任务类型

进入经典数据集