cornstack-java-v1
收藏魔搭社区2025-12-03 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/nomic-ai/cornstack-java-v1
下载链接
链接失效反馈官方服务:
资源简介:
# CoRNStack Python Dataset
The CoRNStack Dataset, accepted to [ICLR 2025](https://arxiv.org/abs/2412.01007), is a large-scale high quality training dataset specifically for code retrieval across multiple
programming languages. This dataset comprises of `<query, positive, negative>` triplets used to train [nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code),
[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed), and [CodeRankLLM](https://huggingface.co/nomic-ai/CodeRankLLM).
## CoRNStack Dataset Curation
Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies.

After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring.
During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time.
## Join the Nomic Community
- Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- Website: [https://nomic.ai](https://nomic.ai)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
# Citation
If you find the model, dataset, or training code useful, please cite our work:
```bibtex
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
# CoRNStack Python 数据集
已被ICLR 2025收录的CoRNStack数据集,是一款专为多编程语言代码检索任务打造的大规模高质量训练数据集。该数据集包含用于训练nomic-embed-code、CodeRankEmbed以及CodeRankLLM的`<查询、正样本、负样本>`三元组。
## CoRNStack 数据集构建流程
我们以去重后的Stackv2数据集为起点,从函数文档字符串(docstring)及其对应的代码中提取文本-代码配对样本。我们过滤掉以下低质量配对:文档字符串非英语、长度过短,或包含URL、HTML标签、非法字符的样本。此外,我们仅保留文本长度不低于256个Token的文档字符串,以助力模型学习长距离依赖关系。

在初步过滤完成后,我们采用双一致性过滤策略来去除可能存在噪声的样本。我们对每一组文档字符串-代码配对进行嵌入表示,并计算每一条文档字符串与所有代码样本之间的相似度。若某一文档字符串对应的代码样本未进入其最相似的前2个代码样本之列,则将该配对从数据集中移除。
在训练阶段,我们采用一种基于课程学习的新型难例挖掘策略,以确保模型从具有挑战性的样本中学习。我们采用基于Softmax的采样策略,随训练进程逐步采样难度逐渐提升的难例负样本。
## 加入Nomic社区
- Nomic Embed 生态平台:[https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- 官网:[https://nomic.ai](https://nomic.ai)
- Twitter(推特):[https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord:[https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
# 引用
如果您认为本模型、数据集或训练代码对您的研究有所帮助,请引用我们的工作:
bibtex
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
提供机构:
maas
创建时间:
2025-03-04



