cornstack-javascript-v1

Name: cornstack-javascript-v1
Creator: maas
Published: 2025-12-03 17:23:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-03-29 收录

下载链接：

https://modelscope.cn/datasets/nomic-ai/cornstack-javascript-v1

下载链接

链接失效反馈

官方服务：

资源简介：

# CoRNStack Javascript Dataset The CoRNStack Dataset, accepted to [ICLR 2025](https://arxiv.org/abs/2412.01007), is a large-scale high quality training dataset specifically for code retrieval across multiple programming languages. This dataset comprises of `<query, positive, negative>` triplets used to train [nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code), [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed), and [CodeRankLLM](https://huggingface.co/nomic-ai/CodeRankLLM). ## CoRNStack Dataset Curation Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8aLYzi1AxGxTKRb5-9m0L.png) After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring. During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time. ## Join the Nomic Community - Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed) - Website: [https://nomic.ai](https://nomic.ai) - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai) - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8) # Citation If you find the model, dataset, or training code useful, please cite our work: ```bibtex @misc{suresh2025cornstackhighqualitycontrastivedata, title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji}, year={2025}, eprint={2412.01007}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.01007}, }

# CoRNStack JavaScript 数据集 CoRNStack数据集已被国际学习表征会议2025（ICLR 2025）收录，相关论文详见[https://arxiv.org/abs/2412.01007](https://arxiv.org/abs/2412.01007)，是一款专为多编程语言代码检索任务打造的大规模高质量训练数据集。该数据集包含`<查询样本、正样本、负样本>`三元组，用于训练[nomic-embed-code](https://huggingface.co/nomic-ai/nomic-embed-code)、[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)以及[CodeRankLLM](https://huggingface.co/nomic-ai/CodeRankLLM)。 ## CoRNStack数据集精选流程我们以去重后的Stackv2数据集为基础，从函数文档字符串（docstring）及其对应代码中提取文本-代码样本对。随后过滤掉以下低质量样本对：文档字符串非英语、长度过短，或包含URL、HTML标签、非法字符的样本对。此外，我们保留文本长度不低于256个Token（Token）的文档字符串，以助力模型学习长距离依赖关系。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/8aLYzi1AxGxTKRb5-9m0L.png) 初始过滤完成后，我们采用双一致性过滤策略以移除可能携带噪声的样本。具体操作如下：对每一组文档字符串与代码对进行嵌入表征，计算各文档字符串与所有代码样本间的相似度；若某文档字符串对应的代码样本未位列其最相似的前2个代码样本之中，则将该样本对从数据集中移除。训练阶段，我们采用一种新颖的基于课程学习的难例挖掘策略，确保模型从具有挑战性的样本中学习。我们使用基于Softmax的采样策略，随训练进程逐步采样难度递增的难负样本。 ## 加入Nomic社区 - Nomic嵌入生态系统：[https://www.nomic.ai/embed](https://www.nomic.ai/embed) - 官方网站：[https://nomic.ai](https://nomic.ai) - Twitter账号：[https://twitter.com/nomic_ai](https://twitter.com/nomic_ai) - Discord社区：[https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8) ## 引用格式若您认为本模型、数据集或训练代码对您的研究有所帮助，请引用我们的工作： bibtex @misc{suresh2025cornstackhighqualitycontrastivedata, title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji}, year={2025}, eprint={2412.01007}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.01007}, }

提供机构：

maas

创建时间：

2025-03-04

搜集汇总

数据集介绍

背景与挑战

背景概述

CoRNStack Javascript Dataset是一个用于代码检索的大规模高质量训练数据集，包含<query, positive, negative>三元组，用于训练nomic-embed-code等模型。它基于Stackv2创建，通过过滤和双重一致性筛选确保数据质量，并采用课程化硬负样本挖掘策略来提升模型性能。

以上内容由遇见数据集搜集并总结生成