lumees/codesearchnet-hard-negatives

Name: lumees/codesearchnet-hard-negatives
Creator: lumees
Published: 2025-11-28 02:09:33
License: 暂无描述

Hugging Face2025-11-28 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/lumees/codesearchnet-hard-negatives

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - code - retrieval - semantic-search - hard-negatives - contrastive-learning - lumees-ai - python - java - go - php - ruby - javascript license: mit task_categories: - sentence-similarity - feature-extraction size_categories: - 1M<n<10M pretty_name: CodeSearchNet Hard Negatives (Filtered) --- # CodeSearchNet Hard Negatives (Filtered) by Lumees AI ## Dataset Summary This dataset is a processed version of the [CodeSearchNet](https://huggingface.co/datasets/code-search-net/code_search_net) dataset, enhanced with **Hard Negative Mining** to facilitate the training of state-of-the-art code retrieval models. It was created by **Lumees AI** to improve the ability of embedding models to distinguish between syntactically similar but functionally different code snippets. * **Developer:** [Lumees AI](https://lumees.io) * **Authors:** Hasan Kurşun, Kerem Berkay Yanık * **Contact:** hello@lumees.io * **Source Data:** CodeSearchNet (Train split) * **Total Samples:** ~1.88M triplets/tuples ## Dataset Structure The dataset is provided in `.jsonl` format. Each line represents a training sample containing a natural language query, the positive ground truth code, and a list of mined hard negatives. ### Data Fields * `query` (string): The natural language docstring/description of the function. * `pos` (string): The positive (ground truth) code snippet. * `neg` (list of strings): A list of hard negative code snippets (semantically similar to the query but incorrect). * `scores` (list of floats): The cosine similarity scores of the negative candidates against the query (computed by the mining model). ### Example Instance ```json { "query": "CommentsView sub-view (will be used recursively)", "pos": "function ThreadBranchView(vm) { ... }", "neg": [ "function CommentReplyView(vm, comment) { ... }", "public function viewAction() { ... }" ], "scores": [0.7502, 0.7481] } ```` ## Methodology & Creation ### Source Model The mining process utilized **[Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)**, a high-performance embedding model, to generate vector representations for both queries and code. ### Mining Process The dataset was constructed using a dense retrieval approach on the entire CodeSearchNet training corpus across 6 languages (Python, Java, Go, PHP, Ruby, JavaScript). 1. **Embedding:** All code snippets in the corpus were encoded into dense vectors. 2. **Retrieval:** For every query, we retrieved the top 50 semantic candidates from the corpus using GPU-accelerated Matrix Multiplication. 3. **Filtration:** * **Self-Exclusion:** The positive ground truth was removed from results. * **Duplicate Removal:** Exact string duplicates of the positive code were removed. * **Score Thresholding:** * **Max Similarity (0.95):** Candidates with scores above 0.95 were discarded to avoid False Negatives (valid code that is too similar to the ground truth). * **Min Similarity (0.35):** Candidates with scores below 0.35 were discarded to ensure the negatives are "hard" enough to be useful for training (avoiding easy negatives). 4. **Selection:** Up to the top **12** valid hard negatives were selected for each query. ## Intended Use This dataset is optimized for: * **Contrastive Learning:** Fine-tuning embedding models using losses like `MultipleNegativesRankingLoss` or `TripletLoss`. * **Code Retrieval:** Improving search relevance in IDEs or code search engines. * **Cross-Lingual Alignment:** The dataset includes cross-lingual negatives (e.g., a Python query retrieving similar PHP code), helping models learn language-agnostic semantic features. ## Licensing This dataset adheres to the licensing terms of the original **CodeSearchNet** dataset (MIT/Permissive). Users should verify specific licensing requirements for individual code snippets if used for commercial code generation. ## Citation If you use this dataset, please cite Lumees AI and the original CodeSearchNet paper: ```bibtex @misc{lumees2025hardnegatives, author = {Hasan KURŞUN, Kerem Berkay YANIK}, title = {CodeSearchNet Hard Negatives (Filtered)}, year = {2025}, publisher = {Lumees AI}, howpublished = {\url{[https://lumees.io](https://lumees.io)}}, email = {hello@lumees.io} } @article{husain2019codesearchnet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ```

提供机构：

lumees

5,000+

优质数据集

54 个

任务类型

进入经典数据集