five

LocoreMind/msswift-locotrainer-trajectories-208

收藏
Hugging Face2026-03-15 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/LocoreMind/msswift-locotrainer-trajectories-208
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en - zh tags: - code - tool-use - distillation - ms-swift - locotrainer size_categories: - n<1K --- # MS-SWIFT LocoTrainer Trajectories Dataset Distillation dataset containing 208 high-quality code analysis trajectories generated by [LocoTrainer-4B](https://huggingface.co/LocoreMind/LocoTrainer-4B) analyzing the [MS-SWIFT](https://github.com/modelscope/ms-swift) codebase. ## 📊 Dataset Summary This dataset captures expert-level MS-SWIFT framework knowledge through multi-turn tool-calling conversations. Each trajectory demonstrates how LocoTrainer-4B explores codebases using Read, Grep, Glob, Write, and Bash tools to answer complex questions about MS-SWIFT training, deployment, and optimization. **Perfect for:** - 🎓 Training smaller models to understand MS-SWIFT - 🛠️ Learning tool-use patterns for code analysis - 📚 Long-context training (avg 45k tokens/sample) - 🔬 Studying agent behavior on real codebases ## 📈 Statistics - **Total Samples**: 208 - **Total Conversations**: 22,800 messages - **Estimated Tokens**: 9.3M - **Average Turns**: 54.1 per trajectory - **Average Length**: ~45k tokens per sample - **Format**: ShareGPT JSONL (MS-SWIFT native) ## 🗂️ Category Distribution | Category | Count | % | |----------|-------|---| | Model Support | 39 | 18.8% | | Training Methods | 35 | 16.8% | | Optimization & Performance | 26 | 12.5% | | Inference & Deployment | 25 | 12.0% | | CLI & Configuration | 22 | 10.6% | | Data Processing | 20 | 9.6% | | Hardware & Distributed | 18 | 8.7% | | Quantization & Export | 14 | 6.7% | | Advanced Features | 9 | 4.3% | ## 📝 Data Format Each line in `train.jsonl` contains one trajectory in ShareGPT format: ```json { "conversations": [ {"from": "system", "value": "System prompt with tool definitions..."}, {"from": "human", "value": "How do I prepare a preference dataset for DPO training?"}, {"from": "gpt", "value": "I'll help you...<tool_call>{\"name\":\"Read\",\"arguments\":{\"file_path\":\"/workspace/ms-swift/docs/source/...\"}}...</tool_call>"}, {"from": "human", "value": "<tool_response>File contents...</tool_response>"}, {"from": "gpt", "value": "Based on the documentation..."} ], "query_id": "msswift_0001", "category": "training_methods", "subcategory": "dpo_data_preparation", "tools": ["Read", "Grep", "Glob", "Write", "Bash"], "metadata": { "turns": 100, "model": "LocoreMind/LocoTrainer-4B", "elapsed_seconds": 66.13 } } ``` ### Conversation Roles - **`system`**: Agent prompt with tool definitions and instructions - **`human`**: User query OR tool execution results (`<tool_response>`) - **`gpt`**: Assistant response with reasoning and tool calls (`<tool_call>`) ## 🚀 Usage ### Load with Datasets Library ```python from datasets import load_dataset dataset = load_dataset("LocoreMind/msswift-locotrainer-trajectories-208") print(dataset['train'][0]) ``` ### Train with MS-SWIFT ```bash swift sft \ --model Qwen/Qwen3-4B-Instruct-2507 \ --dataset LocoreMind/msswift-locotrainer-trajectories-208 \ --train_type full \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --learning_rate 1e-5 \ --gradient_accumulation_steps 4 \ --max_length 32768 \ --output_dir output/locotrainer-distill ``` ### LoRA Fine-Tuning (Memory Efficient) ```bash swift sft \ --model Qwen/Qwen3-4B-Instruct-2507 \ --dataset LocoreMind/msswift-locotrainer-trajectories-208 \ --train_type lora \ --lora_rank 32 \ --lora_alpha 64 \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --learning_rate 5e-5 \ --max_length 32768 \ --output_dir output/locotrainer-lora ``` ## 🔍 Data Quality ### Turn Distribution - **1-10 turns** (17.8%): Simple, focused queries - **11-30 turns** (31.7%): Medium complexity - **31-50 turns** (12.0%): Complex analysis - **100 turns** (38.5%): Maximum complexity (hit limit) **Note**: 80 samples reached the max_turns=100 limit, indicating they required extensive code exploration. These are the most comprehensive but may contain some repetitive patterns. ### Filtering Options If you need higher quality / shorter samples: ```python from datasets import load_dataset dataset = load_dataset("LocoreMind/msswift-locotrainer-trajectories-208") # Filter to samples with <= 50 turns filtered = dataset['train'].filter(lambda x: x['metadata']['turns'] <= 50) print(f"Filtered: {len(filtered)} samples") # ~128 samples ``` ## 🎯 Use Cases 1. **Knowledge Distillation**: Train smaller models to replicate LocoTrainer-4B's MS-SWIFT expertise 2. **Tool-Use Learning**: Learn structured tool-calling patterns for code analysis 3. **Long-Context Training**: Practice with realistic long-context scenarios (avg 45k tokens) 4. **Domain Adaptation**: Inject MS-SWIFT framework knowledge into base models ## 📊 Expected Training Results Based on LocoTrainer-4B's original training (361k samples on 8x H100): - **Training Time**: ~3-5 hours for 208 samples (8x H100, full-param) - **Context Length**: Use 32k+ to capture full trajectories - **Performance**: Should achieve strong MS-SWIFT Q&A capabilities ## 🛠️ Generation Details - **Teacher Model**: [LocoTrainer-4B](https://huggingface.co/LocoreMind/LocoTrainer-4B) - **Codebase**: [MS-SWIFT v4.0](https://github.com/modelscope/ms-swift) - **Hardware**: 8x H100 80GB GPUs - **Collection Time**: ~3 hours for 208 trajectories - **Average Generation**: 14.6 minutes per trajectory - **Framework**: [LocoTrainer](https://github.com/LocoreMind/LocoTrainer) ## 📁 Files - `train.jsonl` - Full dataset (208 samples, 39MB) ## 🎓 Citation ```bibtex @dataset{msswift_locotrainer_trajectories_2026, title={MS-SWIFT LocoTrainer Trajectories: A Distillation Dataset for Code Analysis Agents}, author={LocoreMind}, year={2026}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/datasets/LocoreMind/msswift-locotrainer-trajectories-208}}, } ``` ## 📄 License Apache 2.0 - Inherits from: - [LocoTrainer](https://github.com/LocoreMind/LocoTrainer) - MIT License - [MS-SWIFT](https://github.com/modelscope/ms-swift) - Apache 2.0 ## 🙏 Acknowledgments - **LocoTrainer-4B**: https://huggingface.co/LocoreMind/LocoTrainer-4B - **MS-SWIFT Framework**: https://github.com/modelscope/ms-swift - **Qwen3**: Base model for LocoTrainer-4B - **vLLM**: Efficient inference engine ## 🔗 Related Resources - 🤖 [LocoTrainer-4B Model](https://huggingface.co/LocoreMind/LocoTrainer-4B) - 📦 [LocoTrainer Framework](https://github.com/LocoreMind/LocoTrainer) - 🛠️ [MS-SWIFT Repository](https://github.com/modelscope/ms-swift) - 📊 [Data Collection Scripts](https://github.com/IIIIQIIII/LocoTrainer-DataCollection)
提供机构:
LocoreMind
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作