modularStarEncoder/SynthCode2Code2NL-neardedup

Name: modularStarEncoder/SynthCode2Code2NL-neardedup
Creator: modularStarEncoder
Published: 2025-03-06 02:50:03
License: 暂无描述

Hugging Face2025-03-06 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/modularStarEncoder/SynthCode2Code2NL-neardedup

下载链接

链接失效反馈

官方服务：

资源简介：

SynthCode2Code2NL-neardedup语料库是一个基于CodeSearchNet生成的(comment, code, code)三元组的数据集，用于人类的代码搜索。该数据集通过Qwen 2.5 Coder-7B-Instruct模型生成了其他语言的代码片段，并经过近似去重处理。它包含了多种编程语言和英语自然语言的数据，并用于微调ModularStarEncoder-finetuned模型。

The SynthCode2Code2NL-neardedup corpus is a dataset of (comment, code, code) triplets generated from CodeSearchNet for human code search. The code in secondary languages is generated using Qwen 2.5 Coder-7B-Instruct, and the dataset has undergone a near deduplication process. It includes data in multiple programming languages and English natural language, and has been used to fine-tune the ModularStarEncoder-finetuned model.

提供机构：

modularStarEncoder

5,000+

优质数据集

54 个

任务类型

进入经典数据集