modularStarEncoder/SynthCode2Code2NL-neardedup
收藏Hugging Face2025-03-06 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/modularStarEncoder/SynthCode2Code2NL-neardedup
下载链接
链接失效反馈官方服务:
资源简介:
SynthCode2Code2NL-neardedup语料库是一个基于CodeSearchNet生成的(comment, code, code)三元组的数据集,用于人类的代码搜索。该数据集通过Qwen 2.5 Coder-7B-Instruct模型生成了其他语言的代码片段,并经过近似去重处理。它包含了多种编程语言和英语自然语言的数据,并用于微调ModularStarEncoder-finetuned模型。
The SynthCode2Code2NL-neardedup corpus is a dataset of (comment, code, code) triplets generated from CodeSearchNet for human code search. The code in secondary languages is generated using Qwen 2.5 Coder-7B-Instruct, and the dataset has undergone a near deduplication process. It includes data in multiple programming languages and English natural language, and has been used to fine-tune the ModularStarEncoder-finetuned model.
提供机构:
modularStarEncoder



