nomic-ai/cornstack-javascript-v1
收藏Hugging Face2025-03-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nomic-ai/cornstack-javascript-v1
下载链接
链接失效反馈官方服务:
资源简介:
CoRNStack数据集是一个针对多种编程语言代码检索的大型高质量训练数据集。该数据集由<查询,正例,负例>三元组组成,用于训练nomic-embed-code、CodeRankEmbed和CodeRankLLM等模型。数据集的构建从去重的Stackv2开始,通过函数文档字符串和相应代码创建文本-代码对,并经过一系列过滤步骤来确保质量,包括双一致性过滤和基于课程的学习策略来训练模型,从而学习更具挑战性的例子。
The CoRNStack Dataset is a large-scale high-quality training dataset specifically designed for code retrieval across multiple programming languages. It consists of `<query, positive, negative>` triplets used to train models such as nomic-embed-code, CodeRankEmbed, and CodeRankLLM. The dataset is constructed by starting with deduplicated Stackv2, creating text-code pairs from function docstrings and respective code, and filtering through various quality control steps including dual-consistency filtering and a curriculum-based hard negative mining strategy during training to ensure the model learns from challenging examples.
提供机构:
nomic-ai



