nomic-ai/cornstack-php-v1
收藏Hugging Face2025-03-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nomic-ai/cornstack-php-v1
下载链接
链接失效反馈官方服务:
资源简介:
CoRNStack PHP数据集是一个用于跨多种编程语言代码检索的大规模高质量训练数据集,由<查询,正例,负例>三元组构成,用于训练nomic-embed-code、CodeRankEmbed和CodeRankLLM三个模型。数据集从去重的Stackv2版本中构建文本-代码对,经过严格的过滤以确保数据质量,并采用双一致性过滤和课程式硬负样本挖掘策略进行训练。
The CoRNStack PHP Dataset is a large-scale high-quality training dataset for code retrieval across multiple programming languages, consisting of `<query, positive, negative>` triplets used to train the nomic-embed-code, CodeRankEmbed, and CodeRankLLM models. The dataset is constructed from deduplicated Stackv2, creating text-code pairs that undergo rigorous filtering to ensure quality, and employs dual-consistency filtering and a novel curriculum-based hard negative mining strategy during training.
提供机构:
nomic-ai



