cornstack-go-v1
收藏魔搭社区2025-10-09 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/nomic-ai/cornstack-go-v1
下载链接
链接失效反馈官方服务:
资源简介:
# CoRNStack Go Dataset
The CoRNStack Dataset, accepted to [ICLR 2025](https://arxiv.org/abs/2412.01007), is a large-scale high quality training dataset specifically for code retrieval across multiple
programming languages. This dataset comprises of `` triplets used to train [nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code),
[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed), and [CodeRankLLM](https://huggingface.co/nomic-ai/CodeRankLLM).
## CoRNStack Dataset Curation
Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies.

After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring.
During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time.
## Join the Nomic Community
- Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- Website: [https://nomic.ai](https://nomic.ai)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
# Citation
If you find the model, dataset, or training code useful, please cite our work:
```bibtex
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
# CoRNStack Go 数据集
CoRNStack数据集已被国际学习表征会议2025(ICLR 2025)收录,是一款专为多编程语言代码检索任务打造的大规模高质量训练数据集。该数据集包含用于训练[nomic-embed-code(nomic-embed-code)]、[CodeRankEmbed(CodeRankEmbed)]以及[CodeRankLLM(CodeRankLLM)]的三元组样本。
## CoRNStack数据集构建流程
我们以去重后的Stackv2(Stackv2)数据集为基础,从函数文档字符串及其对应的代码中生成文本-代码对。我们会过滤掉低质量的样本对:若文档字符串非英文、长度过短,或包含URL、HTML标签及非法字符,则将其剔除。此外,我们仅保留文本长度不低于256个Token(Token)的文档字符串,以助力模型学习长距离依赖关系。

初步过滤完成后,我们采用双一致性过滤策略来剔除可能存在噪声的样本。我们会对每个文本-代码对进行嵌入处理,并计算每个文档字符串与所有代码样本之间的相似度。若某文档字符串对应的代码样本未进入其最相似的前2个代码样本之列,则将该样本对从数据集中剔除。
在模型训练阶段,我们采用一种新颖的基于课程学习的难例挖掘策略,确保模型能够从具有挑战性的样本中学习。我们使用基于Softmax的采样策略,随训练进程逐步采样难度递增的难例负样本。
## 加入Nomic社区
- Nomic嵌入生态系统:[https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- 官方网站:[https://nomic.ai](https://nomic.ai)
- Twitter:[https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord社区:[https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
## 引用
若您认为本模型、数据集或训练代码对您的研究有所帮助,请引用我们的工作:
bibtex
@misc{suresh2025cornstackhighqualitycontrastivedata,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
year={2025},
eprint={2412.01007},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.01007},
}
提供机构:
maas
创建时间:
2025-03-04



