lumees/codesearchnet-hard-negatives
收藏Hugging Face2025-11-28 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/lumees/codesearchnet-hard-negatives
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- code
- retrieval
- semantic-search
- hard-negatives
- contrastive-learning
- lumees-ai
- python
- java
- go
- php
- ruby
- javascript
license: mit
task_categories:
- sentence-similarity
- feature-extraction
size_categories:
- 1M<n<10M
pretty_name: CodeSearchNet Hard Negatives (Filtered)
---
# CodeSearchNet Hard Negatives (Filtered) by Lumees AI
## Dataset Summary
This dataset is a processed version of the [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net) dataset, enhanced with **Hard Negative Mining** to facilitate the training of state-of-the-art code retrieval models.
It was created by **Lumees AI** to improve the ability of embedding models to distinguish between syntactically similar but functionally different code snippets.
* **Developer:** [Lumees AI](https://lumees.io)
* **Authors:** Hasan Kurşun, Kerem Berkay Yanık
* **Contact:** hello@lumees.io
* **Source Data:** CodeSearchNet (Train split)
* **Total Samples:** ~1.88M triplets/tuples
## Dataset Structure
The dataset is provided in `.jsonl` format. Each line represents a training sample containing a natural language query, the positive ground truth code, and a list of mined hard negatives.
### Data Fields
* `query` (string): The natural language docstring/description of the function.
* `pos` (string): The positive (ground truth) code snippet.
* `neg` (list of strings): A list of hard negative code snippets (semantically similar to the query but incorrect).
* `scores` (list of floats): The cosine similarity scores of the negative candidates against the query (computed by the mining model).
### Example Instance
```json
{
"query": "CommentsView sub-view (will be used recursively)",
"pos": "function ThreadBranchView(vm) { ... }",
"neg": [
"function CommentReplyView(vm, comment) { ... }",
"public function viewAction() { ... }"
],
"scores": [0.7502, 0.7481]
}
````
## Methodology & Creation
### Source Model
The mining process utilized **[Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)**, a high-performance embedding model, to generate vector representations for both queries and code.
### Mining Process
The dataset was constructed using a dense retrieval approach on the entire CodeSearchNet training corpus across 6 languages (Python, Java, Go, PHP, Ruby, JavaScript).
1. **Embedding:** All code snippets in the corpus were encoded into dense vectors.
2. **Retrieval:** For every query, we retrieved the top 50 semantic candidates from the corpus using GPU-accelerated Matrix Multiplication.
3. **Filtration:**
* **Self-Exclusion:** The positive ground truth was removed from results.
* **Duplicate Removal:** Exact string duplicates of the positive code were removed.
* **Score Thresholding:**
* **Max Similarity (0.95):** Candidates with scores above 0.95 were discarded to avoid False Negatives (valid code that is too similar to the ground truth).
* **Min Similarity (0.35):** Candidates with scores below 0.35 were discarded to ensure the negatives are "hard" enough to be useful for training (avoiding easy negatives).
4. **Selection:** Up to the top **12** valid hard negatives were selected for each query.
## Intended Use
This dataset is optimized for:
* **Contrastive Learning:** Fine-tuning embedding models using losses like `MultipleNegativesRankingLoss` or `TripletLoss`.
* **Code Retrieval:** Improving search relevance in IDEs or code search engines.
* **Cross-Lingual Alignment:** The dataset includes cross-lingual negatives (e.g., a Python query retrieving similar PHP code), helping models learn language-agnostic semantic features.
## Licensing
This dataset adheres to the licensing terms of the original **CodeSearchNet** dataset (MIT/Permissive). Users should verify specific licensing requirements for individual code snippets if used for commercial code generation.
## Citation
If you use this dataset, please cite Lumees AI and the original CodeSearchNet paper:
```bibtex
@misc{lumees2025hardnegatives,
author = {Hasan KURŞUN, Kerem Berkay YANIK},
title = {CodeSearchNet Hard Negatives (Filtered)},
year = {2025},
publisher = {Lumees AI},
howpublished = {\url{[https://lumees.io](https://lumees.io)}},
email = {hello@lumees.io}
}
@article{husain2019codesearchnet,
title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
```
提供机构:
lumees



