GDC2025-DeepSeek-Qwen模型蒸馏挑战赛
收藏魔搭社区2025-02-21 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/Tsumugii24/GDC2025-Competition
下载链接
链接失效反馈官方服务:
资源简介:
### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
# Qwen2.5-3B-Code-full Technical Report
## 1. 摘要
本次**GDC2025-DeepSeek-Qwen模型蒸馏挑战赛**我们选择了**代码生成**作为微调的目标领域。
首先我们收集了当前质量相对较高的开源代码数据集,此外又从deepseek-r1中蒸馏得到5k条代码数据
然后经过自研的数据清洗工具进一步对数据去重,打分,最终筛选得到22k条高质量数据作为微调数据集
微调后的模型具备强大的**逻辑推理,指令遵循和代码生成**能力。能够理解复杂的编程领域专业术语、生成能够正常通过样例测试的规范代码。
训练环境为 **8 \* NVIDIA A800-SXM4-80GB**,两天时间内迭代了数组实验并最终**开源了两个版本的模型权重**
值得注意的是,在Human-Eval的代码能力评测Benchmark上,我们基于Qwen3B-Base预训练模型全量微调得到的两个版本均在Coding Benchmark上全面超越了Qwen2.5-3B-Instruct和Qwen2.5-Coder-3B,体现了我们对于模型蒸馏和数据清洗方法的优越性和数据质量的全面领先。
## 2. 方法
### 2.1 数据集
- **the-stack**
- | 来源 | https://huggingface.co/datasets/bigcode/the-stack |
| -------- | ------------------------------------------------- |
| 数据量 | 3.1TB |
| 详细描述 | The Stack数据集,有3.1TB的合法开源代码语料 |
- **github-code**
- | 来源 | https://huggingface.co/datasets/codeparrot/github-code |
| -------- | ------------------------------------------------------------ |
| 数据量 | 1TB |
| 详细描述 | gitHub-code数据集包含来自 GitHub 的 1.15 亿个代码文件,涵盖 32 种编程语言和 60 种扩展,总数据量达到 1TB。 |
- **livecodebench**
- | 来源 | https://huggingface.co/datasets/livecodebench/code_generation_lite |
| -------- | ------------------------------------------------------------ |
| 数据量 | 3.1k条 |
| 详细描述 | LiveCodeBench 问题收集自竞赛编程网站,特别关注问题质量、测试用例质量和问题难度多样性。该场景目前拥有来自 LeetCode、AtCoder 和 Codeforces 的 500 多个问题。每个问题实例包括问题描述、输入/输出示例和隐藏测试用例。此外,每个问题都标记了其难度级别和发布日期,这允许在不同时间窗口内衡量模型性能。目标是为每个问题实例生成正确且高效的解决方案。 |
- **code_contests**
- | 来源 | https://huggingface.co/datasets/deepmind/code_contests |
| -------- | ------------------------------------------------------------ |
| 数据量 | 3.7k条 |
| 详细描述 | CodeContests 是一个用于机器学习的竞技编程数据集。此数据集在训练 AlphaCode 时被使用。问题包括成对输入和输出的测试用例,以及各种语言中的正确和错误的人类解决方案。 |
- **deepseek-r1-distilled-code-5k**
- | 来源 | deepseek-r1-distilled |
| -------- | ------------------------------------------------------- |
| 数据量 | 5k条 |
| 详细描述 | 本次比赛中使用deepseek-r1蒸馏得到的5k条超高质量代码数据 |
### 2.2 **数据清洗**
我们的数据集Source中一共包含了4个开源的代码数据集和1个从deepseek-r1中蒸馏得到的数据集
我们使用自研的工具对数据进行预处理和清洗操作
包括对**数据的去重,打分和打标筛选,从而得到高质量的数据用于后续微调**
#### Deduplication
##### 精准去重
参考[Nemo-Curator]([NVIDIA/NeMo-Curator: Scalable data pre processing and curation toolkit for LLMs](https://github.com/NVIDIA/NeMo-Curator))部分代码
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/exact_deduplication.py \
--input_file $
input_file \
--save_as $save_as \
--device "gpu"
```
##### 语义去重
基于ChromaDB + Embedding Model实现
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/n_grams.py \
--input_file $
input_file \
--save_as $save_as \
```
#### Tagging
使用大语言模型对数据打Tag
多轮数据处理时,因为我们只对Instruction进行tagging,所以mask掉所有的response
##### Task classification
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/unitag.py \
--device $
device \
--model_name
$model_name \
--input_file $
input_file \
--tag_mission "classification" \
--batch_size $batch_size \
--api True\
--api_url $api
```
**Prompt**
```Python
PROMPT_CLASSIFICATION = """
Instruction
Given a conversation between a human user and an AI assistant where the assistant's responses are masked, please label the task tags for the user's queries.
Conversation
%s
Tagging the user input
Please label the task tags for the user queries. You will need to analyze the user queries and select the most relevant task tag from the list below.
all_task_tags = [
"Information seeking", # Users ask for specific information or facts about various topics.
"Reasoning", # Queries require logical thinking, problem-solving, or processing of complex ideas.
"Planning", # Users need assistance in creating plans or strategies for activities and projects.
"Editing", # Involves editing, rephrasing, proofreading, or other tasks related to the composition of general written content.
"Coding & Debugging", # Users seek help with writing, reviewing, or fixing code in programming.
"Math", # Queries related to mathematical concepts, problems, and calculations.
"Role playing", # Users engage in scenarios requiring LLM to adopt a character or persona.
"Data analysis", # Requests involve interpreting data, statistics, or performing analytical tasks.
"Creative writing", # Users seek assistance with crafting stories, poems, or other creative texts.
"Advice seeking", # Users ask for recommendations or guidance on various personal or professional issues.
"Brainstorming", # Involves generating ideas, creative thinking, or exploring possibilities.
"NSFW": # Queries involves unsafe instructions or generating harmful contents.
"Others" # Any queries that do not fit into the above categories or are of a miscellaneous nature.
]
Output Format:
Note that you can only select a single primary tag. Other applicable tags can be added to the list of other tags.
If there are multiple user queries in the conversation, the primary tag should be the one with the most votes.
Now, please output your tags below in a json format by filling in the placeholders in :
{{
"primary_tag": "",
"other_tags": ["", "", ... ]
}}
"""
```
##### Difficulty classification
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/unitag.py \
--device $
device \
--model_name
$model_name \
--input_file $
input_file \
--tag_mission "difficulty" \
--batch_size $batch_size \
--api True\
--api_url $api
PROMPT_DIFFICULTY = '''
Instruction
Given a conversation between a human user and an AI assistant (the assistant's responses are masked), you first need to identify the user's intents and then label the overall difficulty level of the user queries.
Conversation
%s
Output Format
In your output, you first need to identify the user's intent and the knowledge needed to solve the instructed tasks.
Then, rate the overall difficulty level of the user queries as
very easy, easy, medium, hard, or very hard.
Now, please output the user intent and difficulty level below in a json format by filling in the placeholders in []:
{{
"intent": "The user wants to [....]",
"knowledge": "To solve this problem, the models need to know [....]",
"difficulty": "[very easy/easy/medium/hard/very hard]"
}}
'''
```
##### Input quality classification
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/unitag.py \
--device $
device \
--model_name
$model_name \
--input_file $
input_file \
--tag_mission "quality" \
--batch_size $batch_size \
--api True\
--api_url $api
```
**Prompt**
```Python
PROMPT_QUALITY = '''
Instruction
Given a conversation between a human user and an AI assistant where the assistant's responses are omitted, you need to rate the overall quality of the user's queries based on its clarity, specificity, and coherence.
No need to consider the safety of the user's queries as NSFW contents are permitted here.
The rating scale is as follows:
very poor
poor
average
good
excellent
Conversation
%s
Output Format
Given the conversation, you first need to give an assesement, highlighting the strengths and/or weaknesses of the user's queries.
Then, you need to output a rating from very poor to excellent by filling in the placeholders in [...] and make sure your response is in English:
{{
"explanation": "[...]",
"input_quality": "[very poor/poor/average/good/excellent]"
}}
'''
```
##### Language detection
识别数据部分指令的语言
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/unitag.py \
--device $
device \
--input_file $input_file \
--tag_mission "language" \
```
#### Scoring
##### RM score(打分区间范围为0到10)
基于Reward Model规范对数据质量进行打分和评价,这里使用了适用于筛选的**Criteria**
**Helpfulness Truthfulness Harmlessness Correctness 四个维度**
```Python
CUDA_VISIBLE_DEVICES=
$device python ../exp/unitag.py \
--device $
device \
--input_file
$input_file \
--tag_mission "reward" \
--batch_size 1 \
--api True\
--api_url $
api
--api_reward_url $reward_api
CRITERIA= """
Helpfulness: The content generated should aim to be of assistance to the user. It should provide accurate information, answer questions to the best of the model's knowledge, and offer solutions that are practical and relevant to the user's queries. The model should not generate content that is intentionally misleading or unhelpful.
Truthfulness: The model should strive to provide information that is truthful and based on the most accurate data available up to its knowledge cutoff. It should not fabricate information or present opinions as facts. When uncertain, the model should acknowledge the limitations of its knowledge.
Harmlessness: The content should not cause harm to individuals or groups. It should avoid generating content that is offensive, discriminatory, or promotes harmful behaviors. The model should be programmed to refuse to generate content that could be used for malicious purposes, including but not limited to generating false information, engaging in deception, or promoting illegal activities.
Correctness: The content generated should be factually accurate and logically sound. It must adhere to established facts, principles, and reasoning without introducing errors or contradictions. Correctness ensures that the information provided is not only truthful but also precise, avoiding misinterpretations or inaccuracies that could arise from incomplete or flawed understanding. If the model is uncertain or lacks sufficient knowledge, it should clearly communicate this rather than risk providing incorrect information.
"""
```
#### 清洗策略说明
数据集:见2.1数据集描述
打标模型:[Qwen2.5-72B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)
对于code数据集的清洗策略如下:
1. 语义去重,阈值设定为0.86 key:"deduped"中,若值为-1 时,直接删除。
2. 检查数据中是否存在自我认知数据 如数据样本中会出现“我是xxxxx”之类的自我认知数据,直接删除。
3. 根据打标Tag清洗(RM的分数权重占比最高)
4. 选取RM打分7分以上的code样本
5. 质量Tag为poor的可删(poor但rm大于8分则保留)
6. 质量Tag为average的可删(average但rm大于9分则保留)
#### Example Result
根据以上的数据清洗策略,最终在数量巨大的代码数据集中筛选得到22k条code微调数据
示例如下:

#### 最终code微调数据集
```Plain
清洗前:见2.1数据集部分
清洗后:22199条
数据集名称:code.json
```
### 2.3 模型选择及微调策略
我们采用了以下微调策略:
- **模型架构**:Qwen2.5-3B-Base
- ```Markdown
This repo contains the base 3B Qwen2.5 model, which has the following features:
Type: Causal Language Models
Training Stage: Pretraining
Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
Number of Parameters: 3.09B
Number of Paramaters (Non-Embedding): 2.77B
Number of Layers: 36
Number of Attention Heads (GQA): 16 for Q and 2 for KV
Context Length: Full 32,768 tokens
```
- **微调技术**:全参数微调
- **微调框架:LLaMA-Factory**
- **训练设置**:训练超参数设置如下
- ```Markdown
Training hyperparameters
The following hyperparameters were used during training:
learning_rate: 1e-05
train_batch_size: 4
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 128
total_eval_batch_size: 4
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0
Training results
Framework versions
Transformers 4.48.2
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0
```
- **微调后的模型权重**:
- V1: Qwen2.5-3B-Code-full-0218
- https://modelscope.cn/models/Tsumugii24/Qwen2.5-3B-Code-full-0218
- V2: Qwen2.5-3B-Code-full-0219
- https://modelscope.cn/models/Tsumugii24/Qwen2.5-3B-Code-full-0219
- 备注,一共选取了两版基于Qwen2.5-3B训练的模型,其中V1对应Qwen2.5-3B-Code-full-0218,V2对应Qwen2.5-3B-Code-full-0219,V2中除了第一版模型中的代码数据之外,还额外加入部分高质量的Math Category数据
- **Training Loss**:
- V1: Qwen2.5-3B-Code-full-0218
- 
- V2: Qwen2.5-3B-Code-full-0219
- 
- **SwanLab** **Log**:
- V1: Qwen2.5-3B-Code-full-0218
- https://swanlab.cn/@Tsumugii24/llamafactory/runs/d4rjg2mr2q24sa7fq6znu/chart
- 
- V2: Qwen2.5-3B-Code-full-0219
- https://swanlab.cn/@Tsumugii24/llamafactory/runs/uihw5chtdj36roevac8hp/chart
- 
## 3. 实验结果
### 3.1 评估指标
我们在**HumanEval评测集**上对模型的代码能力进行评测。HumanEval是OpenAI团队开源的一个手写代码评测集,专门用于评估训练在编程任务上的大规模语言模型的能力。该评测集由164个手动编写的编程题目构成,题目覆盖了各种常见的算法问题,旨在考察模型是否能够生成正确且高效的代码。
我们使用了 **pass@k** 指标评估模型的表现。它衡量模型在生成 `k` 个代码样本时,至少有一个样本通过测试的概率。
pass@k 的计算公式如下:
- $$\text{pass@k} = \mathbb{E}_{\text{Problems}} \left[ 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right]$$
其中:
- n 为每个问题生成的代码样本总数。(**我们在评测时取n=5**)
- c 为通过测试的代码样本个数。
- k 为我们关心的前 k 个样本。
最终,pass@k 取所有问题的平均值,衡量模型整体性能。
我们的采样参数设置如下:
| tempreture | top_p | max_len |
| ---------- | ----- | ------- |
| 0 | 1 | 1024 |
### 3.2 Benchmark表现
- **基线模型表现**:
- Qwen2.5-Coder-3B-Instruct 在代码生成任务上的表现。
- | pass@1 | pass@3 | pass@5 |
| ------ | ------ | ------ |
| 11.46% | 19.51% | 23.78% |
- Qwen2.5-Coder-3B 在代码生成任务上的表现。
- | pass@1 | pass@3 | pass@5 |
| ------ | ------ | ------ |
| 15.85% | 32.13% | 41.46% |
- Qwen2.5-3B-Instruct 在代码生成任务上的表现。
- | pass@1 | pass@3 | pass@5 |
| ------ | ------ | ------ |
| 15.73% | 31.28% | 40.85% |
- **微调后模型表现**:基于Qwen2.5-3B 微调后的模型在代码生成任务上的表现提升。
- V1
- | pass@1 | pass@3 | pass@5 |
| --------------- | --------------- | --------------- |
| 17.68%(+1.95%) | 37.07%(+5.79%) | 47.56%(+6.71%) |
- V2
- | pass@1 | pass@3 | pass@5 |
| ---------------- | ---------------- | ---------------- |
| 19.88%(+4.15%) | 39.70%(+8.42%) | 50.61%(+9.76%) |
在代码能力评测结果中,Qwen2.5-3B-Instruct 模型的 pass@k 指标略低于 Qwen2.5-Coder-3B 模型。而同尺寸下我们的 V1 版本全面领先,在 V2 版本优化数据配比后又进一步有了显著的代码能力提升,模型的 pass@k 指标提升了4~10%,猜测在pass@100的指标上提升更大。
评测实验结果由下图可直观对比:


## 4. 总结
本次比赛中,我们通过对DeepSeek-R1进行数据蒸馏以及基于 Qwen2.5-3B 预训练模型的微调,成功实现了模型代码生成能力的显著提升。我们的方法在技术实现难度、模型表现效果以及实际使用价值等方面展现了独特的优势。
#### 4.1 技术实现难度
在数据处理方面,我们面临了海量代码数据的清洗和筛选挑战。通过对多个开源代码数据集的整合,结合自研的数据清洗工具,我们实现了精准去重、语义去重、打分和标签筛选等一系列复杂操作,最终从庞大的数据集中筛选出高质量的微调数据。这一过程不仅考验了团队的技术能力,也体现了我们在数据处理策略上的创新性和有效性。
在模型微调方面,我们采用了全参数微调技术,并结合 LLaMA-Factory 框架进行训练。通过对训练超参数的精细调整,我们成功优化了模型的性能,使其在代码生成任务上表现出色。此外,我们还通过 SwanLab 工具实现了训练过程的可视化和日志记录,进一步提升了训练和迭代效率。
#### 4.2 模型表现效果
在 HumanEval 代码能力评测基准上,我们的微调模型表现卓越。与基线模型相比,我们的 V1 版本在 pass@k 指标上全面领先,V2 版本在进一步优化数据配比后,代码生成能力显著提升,pass@k 指标提升了 4% 至 10%。这一结果不仅证明了我们数据清洗和模型微调方法的有效性,也展示了我们在代码生成领域的技术优势。
#### 4.3 实际使用价值
我们的微调模型具备强大的逻辑推理、指令遵循和代码生成能力,能够理解复杂的编程领域专业术语,并生成规范且高效的代码。这使得模型在实际应用中具有广泛的适用性,例如在软件开发、算法竞赛以及编程教育等领域。此外,我们开源的模型权重和代码仓库也为其他研究者和开发者提供了宝贵的资源,促进了技术的共享和进一步发展。
#### 4.4 未来展望
未来,我们将继续探索将现有的结论和技术迁移到更大的模型尺寸并持续优化训练策略,探索更多应用场景。一方面,我们计划进一步提升模型在复杂编程任务上的表现,例如支持更多编程语言和更高级的算法问题。另一方面,我们也将探索模型在代码补全、代码优化以及代码调试等任务上的应用潜力,为开发者提供更全面的编程辅助工具。
总之,本次比赛的成功不仅体现了我们在技术实现上的优势,也为未来的研究和应用奠定了坚实的基础。我们期待在代码生成领域取得更多突破,为人工智能技术的发展贡献力量。
## 5. 参考链接
- **代码仓库**:https://www.modelscope.cn/datasets/Tsumugii24/GDC2025-Competition
- **SwanLab日志**:
- https://swanlab.cn/@Tsumugii24/llamafactory/runs/d4rjg2mr2q24sa7fq6znu/chart
- https://swanlab.cn/@Tsumugii24/llamafactory/runs/uihw5chtdj36roevac8hp/chart
- **开源模型权重**:
- https://modelscope.cn/models/Tsumugii24/Qwen2.5-3B-Code-full-0218
- https://modelscope.cn/models/Tsumugii24/Qwen2.5-3B-Code-full-0219
- **Demo体验**:
- 备注:0218,即V1版本在创空间中使用CPU部署,推理速度非常慢;建议体验0219,效果更好,速度更快,感谢**xGPU乐园**提供的GPU资源支持
- https://modelscope.cn/studios/Tsumugii24/Qwen2.5-3B-Code-full-0218-demo
- https://modelscope.cn/studios/Tsumugii24/Qwen2.5-3B-Code-full-0219-demo (推荐体验)
提供机构:
maas
创建时间:
2025-02-20
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



