five

JinnP/opc_regen_Qwen3-Coder-30B-A3B-Instruct

收藏
Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/JinnP/opc_regen_Qwen3-Coder-30B-A3B-Instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en - zh tags: - speculative-decoding - specforge - qwen - regenerated pretty_name: OPC Regenerated with Qwen3-Coder-30B-A3B-Instruct size_categories: - 1M<n<10M --- # OPC Regenerated Dataset (Qwen3-Coder-30B-A3B-Instruct) This dataset is a regenerated version of the OPC training dataset, where assistant responses have been regenerated using **Qwen3-Coder-30B-A3B-Instruct** as the target model. ## Purpose Regenerating training data with the target model helps better align the draft model with the target model's output distribution, improving acceptance rates and overall speculative decoding performance in [SpecForge](https://github.com/sgl-project/SpecForge). ## Dataset Statistics | Metric | Value | |--------|-------| | Total entries | 1,023,233 | | File size | 6.4 GB | | Average response length | 5,596 chars | | Median response length | 3,928 chars | ### Response Length Distribution | Category | Count | Percentage | |----------|-------|------------| | Long (>2000 chars) | 787,964 | 77.01% | | Medium (501-2000 chars) | 202,848 | 19.82% | | Short (101-500 chars) | 29,481 | 2.88% | | Very Short (≤100 chars) | 2,940 | 0.29% | ## Generation Configuration - **Target Model**: `Qwen/Qwen3-Coder-30B-A3B-Instruct` - **Max Tokens**: 16,384 - **Temperature**: 0.7 - **Concurrency**: 256 - **Server**: SGLang with TP=8 ## Scripts Used This dataset was generated using SpecForge's data regeneration pipeline. Below are the scripts used: ### 1. SGLang Server Launch Script (`launch_sglang_tp8.sh`) ```bash #!/bin/bash # Launch SINGLE SGLang server using all 8 H200 GPUs with TP=8 SESSION_NAME="sglang_tp8" # Kill existing session if it exists tmux kill-session -t $SESSION_NAME 2>/dev/null # Create new session tmux new-session -d -s $SESSION_NAME " echo 'Starting SGLang Server with TP=8 on all 8 GPUs - Port 30000' && \ FLASHINFER_DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m sglang.launch_server \ --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ --tp 8 \ --cuda-graph-bs 1 2 4 8 16 32 64 128 \ --dtype bfloat16 \ --mem-frac 0.8 \ --port 30000 " echo "Started SGLang server (TP=8) in tmux session: $SESSION_NAME" echo "Server: localhost:30000" echo "" echo "To attach: tmux attach -t $SESSION_NAME" echo "To detach: Ctrl+B, then D" ``` ### 2. Data Regeneration Script (`run_regenerate_tmux.sh`) ```bash #!/bin/bash # Regenerate OPC dataset with Qwen3-Coder in tmux session SESSION_NAME="opc_regen" # Kill existing session if it exists tmux kill-session -t $SESSION_NAME 2>/dev/null # Create new tmux session and run the command tmux new-session -d -s $SESSION_NAME " \ python scripts/regenerate_train_data.py \ --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ --concurrency 256 \ --max-tokens 16384 \ --server-address localhost:30000 \ --temperature 0.7 \ --input-file-path ./cache/dataset/opc_train.jsonl \ --output-file-path ./cache/dataset/opc_regenerated.jsonl " echo "Started regeneration in tmux session: $SESSION_NAME" echo "" echo "To attach: tmux attach -t $SESSION_NAME" echo "To detach: Ctrl+B, then D" echo "To check progress: tmux attach -t $SESSION_NAME" ``` ## Data Format Each line is a JSON object with the following structure: ```json { "id": "unique_id", "conversations": [ {"role": "user", "content": "User message..."}, {"role": "assistant", "content": "Regenerated assistant response..."} ] } ``` ## Usage ```python from datasets import load_dataset dataset = load_dataset("JinnP/opc-regenerated") ``` ## Citation If you use this dataset, please cite SpecForge: ```bibtex @misc{specforge, title={SpecForge: Speculative Decoding with Learned Draft Models}, url={https://github.com/sgl-project/SpecForge}, } ``` ## License Apache 2.0
提供机构:
JinnP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作