JinnP/opc_regen_Qwen3-Coder-30B-A3B-Instruct
收藏Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/JinnP/opc_regen_Qwen3-Coder-30B-A3B-Instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
- zh
tags:
- speculative-decoding
- specforge
- qwen
- regenerated
pretty_name: OPC Regenerated with Qwen3-Coder-30B-A3B-Instruct
size_categories:
- 1M<n<10M
---
# OPC Regenerated Dataset (Qwen3-Coder-30B-A3B-Instruct)
This dataset is a regenerated version of the OPC training dataset, where assistant responses have been regenerated using **Qwen3-Coder-30B-A3B-Instruct** as the target model.
## Purpose
Regenerating training data with the target model helps better align the draft model with the target model's output distribution, improving acceptance rates and overall speculative decoding performance in [SpecForge](https://github.com/sgl-project/SpecForge).
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total entries | 1,023,233 |
| File size | 6.4 GB |
| Average response length | 5,596 chars |
| Median response length | 3,928 chars |
### Response Length Distribution
| Category | Count | Percentage |
|----------|-------|------------|
| Long (>2000 chars) | 787,964 | 77.01% |
| Medium (501-2000 chars) | 202,848 | 19.82% |
| Short (101-500 chars) | 29,481 | 2.88% |
| Very Short (≤100 chars) | 2,940 | 0.29% |
## Generation Configuration
- **Target Model**: `Qwen/Qwen3-Coder-30B-A3B-Instruct`
- **Max Tokens**: 16,384
- **Temperature**: 0.7
- **Concurrency**: 256
- **Server**: SGLang with TP=8
## Scripts Used
This dataset was generated using SpecForge's data regeneration pipeline. Below are the scripts used:
### 1. SGLang Server Launch Script (`launch_sglang_tp8.sh`)
```bash
#!/bin/bash
# Launch SINGLE SGLang server using all 8 H200 GPUs with TP=8
SESSION_NAME="sglang_tp8"
# Kill existing session if it exists
tmux kill-session -t $SESSION_NAME 2>/dev/null
# Create new session
tmux new-session -d -s $SESSION_NAME "
echo 'Starting SGLang Server with TP=8 on all 8 GPUs - Port 30000' && \
FLASHINFER_DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m sglang.launch_server \
--model Qwen/Qwen3-Coder-30B-A3B-Instruct \
--tp 8 \
--cuda-graph-bs 1 2 4 8 16 32 64 128 \
--dtype bfloat16 \
--mem-frac 0.8 \
--port 30000
"
echo "Started SGLang server (TP=8) in tmux session: $SESSION_NAME"
echo "Server: localhost:30000"
echo ""
echo "To attach: tmux attach -t $SESSION_NAME"
echo "To detach: Ctrl+B, then D"
```
### 2. Data Regeneration Script (`run_regenerate_tmux.sh`)
```bash
#!/bin/bash
# Regenerate OPC dataset with Qwen3-Coder in tmux session
SESSION_NAME="opc_regen"
# Kill existing session if it exists
tmux kill-session -t $SESSION_NAME 2>/dev/null
# Create new tmux session and run the command
tmux new-session -d -s $SESSION_NAME "
\
python scripts/regenerate_train_data.py \
--model Qwen/Qwen3-Coder-30B-A3B-Instruct \
--concurrency 256 \
--max-tokens 16384 \
--server-address localhost:30000 \
--temperature 0.7 \
--input-file-path ./cache/dataset/opc_train.jsonl \
--output-file-path ./cache/dataset/opc_regenerated.jsonl
"
echo "Started regeneration in tmux session: $SESSION_NAME"
echo ""
echo "To attach: tmux attach -t $SESSION_NAME"
echo "To detach: Ctrl+B, then D"
echo "To check progress: tmux attach -t $SESSION_NAME"
```
## Data Format
Each line is a JSON object with the following structure:
```json
{
"id": "unique_id",
"conversations": [
{"role": "user", "content": "User message..."},
{"role": "assistant", "content": "Regenerated assistant response..."}
]
}
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("JinnP/opc-regenerated")
```
## Citation
If you use this dataset, please cite SpecForge:
```bibtex
@misc{specforge,
title={SpecForge: Speculative Decoding with Learned Draft Models},
url={https://github.com/sgl-project/SpecForge},
}
```
## License
Apache 2.0
提供机构:
JinnP



