five

nomic-ai/cornstack-java-v1

收藏
Hugging Face2025-03-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nomic-ai/cornstack-java-v1
下载链接
链接失效反馈
官方服务:
资源简介:
CoRNStack数据集是一个大规模的高质量训练数据集,用于跨多种编程语言的代码检索。该数据集由<查询,正例,负例>三元组组成,用于训练nomic-embed-code、CodeRankEmbed和CodeRankLLM模型。数据集从去重的Stackv2开始构建,通过筛选出高质量的文本-代码对,并进行双重一致性过滤以去除噪声样本。训练过程中采用了新颖的课程式硬负样本挖掘策略。

The CoRNStack Dataset is a large-scale high-quality training dataset specifically designed for code retrieval across multiple programming languages. It consists of `<query, positive, negative>` triplets for training the nomic-embed-code, CodeRankEmbed, and CodeRankLLM models. The dataset is constructed starting from the deduplicated Stackv2, creating high-quality text-code pairs from function docstrings and respective code, and applying dual-consistency filtering to remove noisy examples. A novel curriculum-based hard negative mining strategy is employed during training.
提供机构:
nomic-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作