benjamintli/codesearchnet_synthetic_queries_100k
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/benjamintli/codesearchnet_synthetic_queries_100k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- code
tags:
- code-search
- code-retrieval
- semantic-search
- synthetic-data
- contrastive-learning
- embedding
- information-retrieval
- code-understanding
task_categories:
- text-retrieval
- feature-extraction
task_ids:
- document-retrieval
size_categories:
- 10K<n<100K
license: apache-2.0
dataset_info:
features:
- name: code
dtype: large_string
- name: docstring
dtype: large_string
- name: language
dtype: large_string
- name: scenario
dtype: string
- name: query
dtype: string
splits:
- name: train
num_bytes: 133859463
num_examples: 100000
- name: test
num_bytes: 15778694
num_examples: 10000
- name: valid
num_bytes: 14473584
num_examples: 10000
download_size: 83871385
dataset_size: 164111741
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: valid
path: data/valid-*
---
# Dataset Card: Synthetic Code Search Queries
## Overview
A synthetic dataset of (query, code) pairs for training code search and retrieval models. Queries are designed to reflect realistic developer search behavior — how someone would search for code they haven't seen yet — rather than paraphrased docstrings.
## Motivation
CodeSearchNet is a dataset which has docstring & code pairs. The problem is, not all the docstrings in CodeSearchNet accurately reflect the intent that developers use when they look for code in a codebase. CoSQA does a better job of this (since its bing queries, I think), however its dataset is quite small compared to CodeSearchNet.
This dataset addresses that gap through a two-call generation pipeline that structurally prevents the model from paraphrasing source code.
## Source Data
- **Code corpus**: 100k functions sampled from CodeSearchNet
- **Languages**: Multiple (Python, Java, JavaScript, Go, PHP, Ruby)
- **Preprocessing**: CodeSearchNet was proportionally sampled to 100k
## Generation Pipeline
### Two-Call Architecture
The key design decision is splitting query generation into two sequential LLM calls.
If I give an LLM a prompt with the docstring and code, and ask it to generate a query it's likely to lightly paraphrased docstrings. This isn't helpful as it doesn't really enrich the dataset further (e.g. "adding how do I..." in front of the docstring isn't adding much).
#### Call 1: Code → Scenario
- **Input**: Code snippet with docstrings/comments stripped
- **Output**: 2-3 sentence scenario describing a realistic situation where a developer would need this functionality
- **Purpose**: Translates implementation details into problem-space language
- **Settings**: temperature 0.7, max_tokens 256
**Prompt:**
```
You are analyzing a code snippet to understand what real-world problem it solves.
Describe a specific, realistic scenario where a software developer would need
this functionality. Focus on:
- What they are building or fixing
- Why they need this specific behavior
- What went wrong or what feature they're implementing
Do NOT describe what the code does. Describe the SITUATION that leads someone
to need it.
Be specific and concrete — name the kind of project, the task, the context.
2-3 sentences max.
```
Few-shot examples were included to calibrate specificity (e.g., ETL retry logic, nested JSON flattening).
#### Call 2: Scenario → Query
- **Input**: Only the scenario from Call 1 (no code, no function names, no docstrings)
- **Output**: 3-15 word search query
- **Purpose**: Since the model cannot see the code, it cannot paraphrase it
- **Settings**: temperature 0.3, max_tokens 64, stop on newline
**Prompt:**
```
You are a software developer searching for code to help with a task. Based on
the situation described below, write what you would type into a code search
tool to find a solution.
Rules:
- Write ONLY the search query, nothing else
- 3-15 words
- Fragments are fine ("retry with backoff", "flatten nested dict")
- Skip language tags unless the query would be ambiguous without one
- Vary style naturally: sometimes a question, sometimes keywords, sometimes
a phrase
```
Few-shot examples were included to calibrate query style.
### Why Two Calls?
Single-call approaches (code → query) consistently produce paraphrased docstrings regardless of prompting strategy. We tested:
1. **Direct generation with docstrings**: Queries were near-reformulations of docstrings
2. **Direct generation without docstrings**: Queries described the code's behavior rather than the developer's need
3. **Constrained generation** (no reuse of function/parameter names): Model ignored constraints or drew from code logic and error messages instead
The two-call approach works because Call 2 **physically cannot access** implementation vocabulary — the scenario acts as a lossy compression that preserves intent but strips code-specific language.
## Generation Model
- **Model**: Qwen3.5-9B (dense)
- **Inference**: vLLM with FP8 quantization, prefix caching enabled
- **Hardware**: NVIDIA RTX Pro 6000
The 9B dense model was chosen over the larger Qwen3.5-35B-A3B (MoE) after testing showed negligible quality difference for this task. The dense architecture provides significantly better throughput on vLLM for batch generation.
## Dataset Fields
| Field | Description |
|---|---|
| `code` | Original code snippet |
| `query` | Synthetic search query generated via the two-call pipeline |
| `scenario` | Intermediate scenario from Call 1 (useful for debugging and regeneration) |
| `language` | Programming language |
| `docstring` | Original docstring (not used in generation, included for comparison) |
## Known Limitations
- **Hallucinated scenarios**: For ambiguous or domain-specific code, the model sometimes invents plausible but inaccurate contexts. The query may still be usable for training even when the scenario is wrong, but relevance is not guaranteed in these cases.
- **Query length distribution**: Queries tend toward the longer/more specific end (8-12 words). Real developer search behavior has more variance, including very short keyword queries.
- **Scenario length**: Some scenarios exceed the requested 2-3 sentences. This does not affect query quality but represents wasted generation tokens.
- **Domain-specific code**: Generic utility functions produce better scenarios and queries than niche library internals where the code alone doesn't convey sufficient context.
## Intended Use
- Fine-tuning code embedding models for semantic code search
- Contrastive training with hard negative mining
- Benchmarking code retrieval systems
- Augmenting existing code search datasets with higher-quality queries
## Citation
If you use this dataset, please cite it appropriately and link back to this repository.
提供机构:
benjamintli



