modularStarEncoder/SynthCoNL-neardedup
收藏Hugging Face2025-05-20 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/modularStarEncoder/SynthCoNL-neardedup
下载链接
链接失效反馈官方服务:
资源简介:
SynthCoNL-neardedup数据集是一个由(comment, code, code)三元组组成的语料库,基于CodeSearchNet数据集生成,并用于微调ModularStarEncoder-finetuned模型。该数据集遵循了MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings论文中的近复制删除过程。数据集包含多种编程语言(如Go、Java、JavaScript、PHP、Python、Ruby、C++、C)和英语自然语言。每个样本包括GitHub仓库中函数的相关信息,如仓库名、函数路径、源代码及其英文文档等。
The SynthCoNL-neardedup dataset is a corpus of (comment, code, code) triplets generated based on the CodeSearchNet dataset and used for finetuning the ModularStarEncoder-finetuned model. The dataset follows the near-deduplication process described in the paper MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings. It includes multiple programming languages (such as Go, Java, JavaScript, PHP, Python, Ruby, C++, C) and English as a natural language. Each sample contains information related to functions in GitHub repositories, such as repository name, function path, source code, and English documentation.
提供机构:
modularStarEncoder



