longertime/sn120-dpo-training

Name: longertime/sn120-dpo-training
Creator: longertime
Published: 2026-03-26 20:39:32
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/longertime/sn120-dpo-training

下载链接

链接失效反馈

官方服务：

资源简介：

# SN120 LoRA DPO Training Package LoRA DPO training to beat the top miner on Bittensor Subnet 120. ## Data - `data/dpo_train.jsonl` — 5,983 deduped DPO pairs from real validator scores - Strategy: best-of-all-miners (up to 3 unique chosen x 3 unique rejected per task) - Covers all 6 environments: GAME (1,014), LGC-v2 (983), PRINT (765), LIVEWEB (1,166), NAVWORLD (1,545), SWE-INFINITE (510) ## Quick Start on GPU VPS (8x H200) ```bash # 1. Upload this folder to your GPU VPS scp -r /root/sn120-dpo-training root@GPU_VPS_IP:/root/ # 2. SSH into GPU VPS ssh root@GPU_VPS_IP # 3. Get your base model (choose one): # Option A: Pull top miner # af pull 103 --model-path /root/BaseModel # Option B: Download from HF # huggingface-cli download EdmondMillion/affine-28-5CSriXZUwkoqdKBF4kqgRPBgrRiyPbLEo6TBaR3rW3u5qo4T \ # --local-dir /root/BaseModel --token YOUR_HF_TOKEN # Option C: Use your own model # export BASE_MODEL_PATH=/path/to/your/model # 4. Run training (~1 hour total: 30-60 min train + 20 min merge) cd /root/sn120-dpo-training bash train_dpo.sh # 5. Upload trained model huggingface-cli upload YOUR_USER/Affine-SN120-DPO ./model_output_dpo # 6. Back on your VPS, deploy: af chutes_push --repo YOUR_USER/Affine-SN120-DPO --revision SHA af commit --repo YOUR_USER/Affine-SN120-DPO --revision SHA --chute-id ID ``` ## Training Details | Setting | Value | |---------|-------| | Method | LoRA DPO (PEFT) | | LoRA rank | 64 | | LoRA alpha | 128 | | LoRA targets | q_proj, k_proj, v_proj, o_proj | | Trainable params | ~80M (0.25% of 32B) | | Learning rate | 5e-5 | | DPO beta | 0.1 | | Batch size | 1 x 8 grad_accum x 8 GPUs = 64 effective | | Epochs | 2 | | Max length | 4096 | | Precision | bfloat16 | | DeepSpeed | ZeRO-3 | | Est. time | ~30-60 min train + 20 min merge | ## DPO Data Per Environment | Env | Tasks w/ Signal | DPO Pairs | Chosen Threshold | |-----|----------------|-----------|-----------------| | GAME (3x) | 159 | 1,014 | >= 0.5 | | LGC-v2 | 198 | 983 | >= 1.0 | | PRINT | 122 | 765 | >= 1.0 | | LIVEWEB | 193 | 1,166 | >= 0.5 | | NAVWORLD | 187 | 1,545 | >= 0.5 | | SWE-INFINITE | 126 | 510 | >= 1.0 | | **Total** | **985** | **5,983** | | ## Why LoRA Instead of Full Fine-Tuning - 5,983 pairs / 32B params = extreme overfitting risk with FFT - 5,983 pairs / 80M LoRA params = healthy ratio, natural regularization - Base model knowledge preserved (frozen weights) - Faster (30-60 min vs 2-3 hrs), cheaper (~$8 vs $20) - Merged model is identical architecture -- no inference overhead

提供机构：

longertime

5,000+

优质数据集

54 个

任务类型

进入经典数据集