LocoreMind/msswift-locotrainer-trajectories-208
收藏Hugging Face2026-03-15 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/LocoreMind/msswift-locotrainer-trajectories-208
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
- zh
tags:
- code
- tool-use
- distillation
- ms-swift
- locotrainer
size_categories:
- n<1K
---
# MS-SWIFT LocoTrainer Trajectories Dataset
Distillation dataset containing 208 high-quality code analysis trajectories generated by [LocoTrainer-4B](https://huggingface.co/LocoreMind/LocoTrainer-4B) analyzing the [MS-SWIFT](https://github.com/modelscope/ms-swift) codebase.
## 📊 Dataset Summary
This dataset captures expert-level MS-SWIFT framework knowledge through multi-turn tool-calling conversations. Each trajectory demonstrates how LocoTrainer-4B explores codebases using Read, Grep, Glob, Write, and Bash tools to answer complex questions about MS-SWIFT training, deployment, and optimization.
**Perfect for:**
- 🎓 Training smaller models to understand MS-SWIFT
- 🛠️ Learning tool-use patterns for code analysis
- 📚 Long-context training (avg 45k tokens/sample)
- 🔬 Studying agent behavior on real codebases
## 📈 Statistics
- **Total Samples**: 208
- **Total Conversations**: 22,800 messages
- **Estimated Tokens**: 9.3M
- **Average Turns**: 54.1 per trajectory
- **Average Length**: ~45k tokens per sample
- **Format**: ShareGPT JSONL (MS-SWIFT native)
## 🗂️ Category Distribution
| Category | Count | % |
|----------|-------|---|
| Model Support | 39 | 18.8% |
| Training Methods | 35 | 16.8% |
| Optimization & Performance | 26 | 12.5% |
| Inference & Deployment | 25 | 12.0% |
| CLI & Configuration | 22 | 10.6% |
| Data Processing | 20 | 9.6% |
| Hardware & Distributed | 18 | 8.7% |
| Quantization & Export | 14 | 6.7% |
| Advanced Features | 9 | 4.3% |
## 📝 Data Format
Each line in `train.jsonl` contains one trajectory in ShareGPT format:
```json
{
"conversations": [
{"from": "system", "value": "System prompt with tool definitions..."},
{"from": "human", "value": "How do I prepare a preference dataset for DPO training?"},
{"from": "gpt", "value": "I'll help you...<tool_call>{\"name\":\"Read\",\"arguments\":{\"file_path\":\"/workspace/ms-swift/docs/source/...\"}}...</tool_call>"},
{"from": "human", "value": "<tool_response>File contents...</tool_response>"},
{"from": "gpt", "value": "Based on the documentation..."}
],
"query_id": "msswift_0001",
"category": "training_methods",
"subcategory": "dpo_data_preparation",
"tools": ["Read", "Grep", "Glob", "Write", "Bash"],
"metadata": {
"turns": 100,
"model": "LocoreMind/LocoTrainer-4B",
"elapsed_seconds": 66.13
}
}
```
### Conversation Roles
- **`system`**: Agent prompt with tool definitions and instructions
- **`human`**: User query OR tool execution results (`<tool_response>`)
- **`gpt`**: Assistant response with reasoning and tool calls (`<tool_call>`)
## 🚀 Usage
### Load with Datasets Library
```python
from datasets import load_dataset
dataset = load_dataset("LocoreMind/msswift-locotrainer-trajectories-208")
print(dataset['train'][0])
```
### Train with MS-SWIFT
```bash
swift sft \
--model Qwen/Qwen3-4B-Instruct-2507 \
--dataset LocoreMind/msswift-locotrainer-trajectories-208 \
--train_type full \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 4 \
--max_length 32768 \
--output_dir output/locotrainer-distill
```
### LoRA Fine-Tuning (Memory Efficient)
```bash
swift sft \
--model Qwen/Qwen3-4B-Instruct-2507 \
--dataset LocoreMind/msswift-locotrainer-trajectories-208 \
--train_type lora \
--lora_rank 32 \
--lora_alpha 64 \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--learning_rate 5e-5 \
--max_length 32768 \
--output_dir output/locotrainer-lora
```
## 🔍 Data Quality
### Turn Distribution
- **1-10 turns** (17.8%): Simple, focused queries
- **11-30 turns** (31.7%): Medium complexity
- **31-50 turns** (12.0%): Complex analysis
- **100 turns** (38.5%): Maximum complexity (hit limit)
**Note**: 80 samples reached the max_turns=100 limit, indicating they required extensive code exploration. These are the most comprehensive but may contain some repetitive patterns.
### Filtering Options
If you need higher quality / shorter samples:
```python
from datasets import load_dataset
dataset = load_dataset("LocoreMind/msswift-locotrainer-trajectories-208")
# Filter to samples with <= 50 turns
filtered = dataset['train'].filter(lambda x: x['metadata']['turns'] <= 50)
print(f"Filtered: {len(filtered)} samples") # ~128 samples
```
## 🎯 Use Cases
1. **Knowledge Distillation**: Train smaller models to replicate LocoTrainer-4B's MS-SWIFT expertise
2. **Tool-Use Learning**: Learn structured tool-calling patterns for code analysis
3. **Long-Context Training**: Practice with realistic long-context scenarios (avg 45k tokens)
4. **Domain Adaptation**: Inject MS-SWIFT framework knowledge into base models
## 📊 Expected Training Results
Based on LocoTrainer-4B's original training (361k samples on 8x H100):
- **Training Time**: ~3-5 hours for 208 samples (8x H100, full-param)
- **Context Length**: Use 32k+ to capture full trajectories
- **Performance**: Should achieve strong MS-SWIFT Q&A capabilities
## 🛠️ Generation Details
- **Teacher Model**: [LocoTrainer-4B](https://huggingface.co/LocoreMind/LocoTrainer-4B)
- **Codebase**: [MS-SWIFT v4.0](https://github.com/modelscope/ms-swift)
- **Hardware**: 8x H100 80GB GPUs
- **Collection Time**: ~3 hours for 208 trajectories
- **Average Generation**: 14.6 minutes per trajectory
- **Framework**: [LocoTrainer](https://github.com/LocoreMind/LocoTrainer)
## 📁 Files
- `train.jsonl` - Full dataset (208 samples, 39MB)
## 🎓 Citation
```bibtex
@dataset{msswift_locotrainer_trajectories_2026,
title={MS-SWIFT LocoTrainer Trajectories: A Distillation Dataset for Code Analysis Agents},
author={LocoreMind},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/datasets/LocoreMind/msswift-locotrainer-trajectories-208}},
}
```
## 📄 License
Apache 2.0 - Inherits from:
- [LocoTrainer](https://github.com/LocoreMind/LocoTrainer) - MIT License
- [MS-SWIFT](https://github.com/modelscope/ms-swift) - Apache 2.0
## 🙏 Acknowledgments
- **LocoTrainer-4B**: https://huggingface.co/LocoreMind/LocoTrainer-4B
- **MS-SWIFT Framework**: https://github.com/modelscope/ms-swift
- **Qwen3**: Base model for LocoTrainer-4B
- **vLLM**: Efficient inference engine
## 🔗 Related Resources
- 🤖 [LocoTrainer-4B Model](https://huggingface.co/LocoreMind/LocoTrainer-4B)
- 📦 [LocoTrainer Framework](https://github.com/LocoreMind/LocoTrainer)
- 🛠️ [MS-SWIFT Repository](https://github.com/modelscope/ms-swift)
- 📊 [Data Collection Scripts](https://github.com/IIIIQIIII/LocoTrainer-DataCollection)
提供机构:
LocoreMind



