benjamintli/codesearchnet_synthetic_queries_100k

Name: benjamintli/codesearchnet_synthetic_queries_100k
Creator: benjamintli
Published: 2026-03-20 18:39:02
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/benjamintli/codesearchnet_synthetic_queries_100k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - code tags: - code-search - code-retrieval - semantic-search - synthetic-data - contrastive-learning - embedding - information-retrieval - code-understanding task_categories: - text-retrieval - feature-extraction task_ids: - document-retrieval size_categories: - 10K<n<100K license: apache-2.0 dataset_info: features: - name: code dtype: large_string - name: docstring dtype: large_string - name: language dtype: large_string - name: scenario dtype: string - name: query dtype: string splits: - name: train num_bytes: 133859463 num_examples: 100000 - name: test num_bytes: 15778694 num_examples: 10000 - name: valid num_bytes: 14473584 num_examples: 10000 download_size: 83871385 dataset_size: 164111741 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: valid path: data/valid-* --- # Dataset Card: Synthetic Code Search Queries ## Overview A synthetic dataset of (query, code) pairs for training code search and retrieval models. Queries are designed to reflect realistic developer search behavior — how someone would search for code they haven't seen yet — rather than paraphrased docstrings. ## Motivation CodeSearchNet is a dataset which has docstring & code pairs. The problem is, not all the docstrings in CodeSearchNet accurately reflect the intent that developers use when they look for code in a codebase. CoSQA does a better job of this (since its bing queries, I think), however its dataset is quite small compared to CodeSearchNet. This dataset addresses that gap through a two-call generation pipeline that structurally prevents the model from paraphrasing source code. ## Source Data - **Code corpus**: 100k functions sampled from CodeSearchNet - **Languages**: Multiple (Python, Java, JavaScript, Go, PHP, Ruby) - **Preprocessing**: CodeSearchNet was proportionally sampled to 100k ## Generation Pipeline ### Two-Call Architecture The key design decision is splitting query generation into two sequential LLM calls. If I give an LLM a prompt with the docstring and code, and ask it to generate a query it's likely to lightly paraphrased docstrings. This isn't helpful as it doesn't really enrich the dataset further (e.g. "adding how do I..." in front of the docstring isn't adding much). #### Call 1: Code → Scenario - **Input**: Code snippet with docstrings/comments stripped - **Output**: 2-3 sentence scenario describing a realistic situation where a developer would need this functionality - **Purpose**: Translates implementation details into problem-space language - **Settings**: temperature 0.7, max_tokens 256 **Prompt:** ``` You are analyzing a code snippet to understand what real-world problem it solves. Describe a specific, realistic scenario where a software developer would need this functionality. Focus on: - What they are building or fixing - Why they need this specific behavior - What went wrong or what feature they're implementing Do NOT describe what the code does. Describe the SITUATION that leads someone to need it. Be specific and concrete — name the kind of project, the task, the context. 2-3 sentences max. ``` Few-shot examples were included to calibrate specificity (e.g., ETL retry logic, nested JSON flattening). #### Call 2: Scenario → Query - **Input**: Only the scenario from Call 1 (no code, no function names, no docstrings) - **Output**: 3-15 word search query - **Purpose**: Since the model cannot see the code, it cannot paraphrase it - **Settings**: temperature 0.3, max_tokens 64, stop on newline **Prompt:** ``` You are a software developer searching for code to help with a task. Based on the situation described below, write what you would type into a code search tool to find a solution. Rules: - Write ONLY the search query, nothing else - 3-15 words - Fragments are fine ("retry with backoff", "flatten nested dict") - Skip language tags unless the query would be ambiguous without one - Vary style naturally: sometimes a question, sometimes keywords, sometimes a phrase ``` Few-shot examples were included to calibrate query style. ### Why Two Calls? Single-call approaches (code → query) consistently produce paraphrased docstrings regardless of prompting strategy. We tested: 1. **Direct generation with docstrings**: Queries were near-reformulations of docstrings 2. **Direct generation without docstrings**: Queries described the code's behavior rather than the developer's need 3. **Constrained generation** (no reuse of function/parameter names): Model ignored constraints or drew from code logic and error messages instead The two-call approach works because Call 2 **physically cannot access** implementation vocabulary — the scenario acts as a lossy compression that preserves intent but strips code-specific language. ## Generation Model - **Model**: Qwen3.5-9B (dense) - **Inference**: vLLM with FP8 quantization, prefix caching enabled - **Hardware**: NVIDIA RTX Pro 6000 The 9B dense model was chosen over the larger Qwen3.5-35B-A3B (MoE) after testing showed negligible quality difference for this task. The dense architecture provides significantly better throughput on vLLM for batch generation. ## Dataset Fields | Field | Description | |---|---| | `code` | Original code snippet | | `query` | Synthetic search query generated via the two-call pipeline | | `scenario` | Intermediate scenario from Call 1 (useful for debugging and regeneration) | | `language` | Programming language | | `docstring` | Original docstring (not used in generation, included for comparison) | ## Known Limitations - **Hallucinated scenarios**: For ambiguous or domain-specific code, the model sometimes invents plausible but inaccurate contexts. The query may still be usable for training even when the scenario is wrong, but relevance is not guaranteed in these cases. - **Query length distribution**: Queries tend toward the longer/more specific end (8-12 words). Real developer search behavior has more variance, including very short keyword queries. - **Scenario length**: Some scenarios exceed the requested 2-3 sentences. This does not affect query quality but represents wasted generation tokens. - **Domain-specific code**: Generic utility functions produce better scenarios and queries than niche library internals where the code alone doesn't convey sufficient context. ## Intended Use - Fine-tuning code embedding models for semantic code search - Contrastive training with hard negative mining - Benchmarking code retrieval systems - Augmenting existing code search datasets with higher-quality queries ## Citation If you use this dataset, please cite it appropriately and link back to this repository.

提供机构：

benjamintli

5,000+

优质数据集

54 个

任务类型

进入经典数据集