longertime/sn120-dpo-training
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/longertime/sn120-dpo-training
下载链接
链接失效反馈官方服务:
资源简介:
# SN120 LoRA DPO Training Package
LoRA DPO training to beat the top miner on Bittensor Subnet 120.
## Data
- `data/dpo_train.jsonl` — 5,983 deduped DPO pairs from real validator scores
- Strategy: best-of-all-miners (up to 3 unique chosen x 3 unique rejected per task)
- Covers all 6 environments: GAME (1,014), LGC-v2 (983), PRINT (765), LIVEWEB (1,166), NAVWORLD (1,545), SWE-INFINITE (510)
## Quick Start on GPU VPS (8x H200)
```bash
# 1. Upload this folder to your GPU VPS
scp -r /root/sn120-dpo-training root@GPU_VPS_IP:/root/
# 2. SSH into GPU VPS
ssh root@GPU_VPS_IP
# 3. Get your base model (choose one):
# Option A: Pull top miner
# af pull 103 --model-path /root/BaseModel
# Option B: Download from HF
# huggingface-cli download EdmondMillion/affine-28-5CSriXZUwkoqdKBF4kqgRPBgrRiyPbLEo6TBaR3rW3u5qo4T \
# --local-dir /root/BaseModel --token YOUR_HF_TOKEN
# Option C: Use your own model
# export BASE_MODEL_PATH=/path/to/your/model
# 4. Run training (~1 hour total: 30-60 min train + 20 min merge)
cd /root/sn120-dpo-training
bash train_dpo.sh
# 5. Upload trained model
huggingface-cli upload YOUR_USER/Affine-SN120-DPO ./model_output_dpo
# 6. Back on your VPS, deploy:
af chutes_push --repo YOUR_USER/Affine-SN120-DPO --revision SHA
af commit --repo YOUR_USER/Affine-SN120-DPO --revision SHA --chute-id ID
```
## Training Details
| Setting | Value |
|---------|-------|
| Method | LoRA DPO (PEFT) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj |
| Trainable params | ~80M (0.25% of 32B) |
| Learning rate | 5e-5 |
| DPO beta | 0.1 |
| Batch size | 1 x 8 grad_accum x 8 GPUs = 64 effective |
| Epochs | 2 |
| Max length | 4096 |
| Precision | bfloat16 |
| DeepSpeed | ZeRO-3 |
| Est. time | ~30-60 min train + 20 min merge |
## DPO Data Per Environment
| Env | Tasks w/ Signal | DPO Pairs | Chosen Threshold |
|-----|----------------|-----------|-----------------|
| GAME (3x) | 159 | 1,014 | >= 0.5 |
| LGC-v2 | 198 | 983 | >= 1.0 |
| PRINT | 122 | 765 | >= 1.0 |
| LIVEWEB | 193 | 1,166 | >= 0.5 |
| NAVWORLD | 187 | 1,545 | >= 0.5 |
| SWE-INFINITE | 126 | 510 | >= 1.0 |
| **Total** | **985** | **5,983** | |
## Why LoRA Instead of Full Fine-Tuning
- 5,983 pairs / 32B params = extreme overfitting risk with FFT
- 5,983 pairs / 80M LoRA params = healthy ratio, natural regularization
- Base model knowledge preserved (frozen weights)
- Faster (30-60 min vs 2-3 hrs), cheaper (~$8 vs $20)
- Merged model is identical architecture -- no inference overhead
提供机构:
longertime



