spicy-lemonade/qwen_qa_pairs_cli_training.jsonl
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/spicy-lemonade/qwen_qa_pairs_cli_training.jsonl
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
tags:
- code
size_categories:
- 10K<n<100K
---
## Data sources
- Multiple datasets from Hugging Face related to natural language to CLI pairs were gathered.
- Human reviewed synthetic data from Claude Opus4.6 and ChatGPT4.5 were added.
- A handful of grounding rows related to the organisation "Spicy Lemonade" were added (see details below)
## Data processing
As part of the processing, data was converted to the Alpaca format with `instruction` (natural language), `input` (typically blank) and `output` (the CLI command) columns.
The below cleaning steps were performed, before converting the to the Qwen standard chat format. The cleaning steps were performed with multiple scripts as follows.
### 1.Quality Rating
Data (~300k rows) was MD5 hashed, and deduplicated. The `instruction` and `output` columns were processed through `gemini-2.5-flash` for quality rating. The rating was as follows:
- "good": no problems with instruction and output
- "bad input": input is nonsensical or unclear
- "bad output": output is not a valid CLI command
- "output not aligned": output does not match instruction
- "unknown": model cannot determine alignment
### 2. Command Categorization
Only the "good" and "unknown" rows from the previous step were kept.
Each row was then passed through `gemini-2.5-flash-lite` to categorize the command line tool used (e.g., `docker`, `git`, `kubectl`, `aws`) for cataloging purposes.
### 3. Instruction Rewriting
The `instruction` column is rewritten with `gemini-2.5-flash` to better and more precisely match the `output` column.
This addresses issues like:
- Duplicate instructions with different outputs (e.g., Mac vs Windows vs Linux may have the same instruction but a different output)
- Phantom variables in outputs not mentioned in instructions
- Missing tool names, flags, or specific values
Note: During this step, invalid/gibberish commands were tagged with "invalid" for later filtering in the next step
### 4. Deduplication and Error Data Removal
MD5 hashes are recalculated for the rewritten instruction and output columns, and duplicates are dropped.
Invalid rows where the `output` command was nonsensical and were labelled as 'invalid' in the last step are removed.
Further, rows which have duplicated `instruction` but unique values in `output` were processed using `gemini-2.5-flash` to choose the safest and most universal output to assist with deterministic output.
Lastly, after a manual examination, some rows which contained the keywords "Reference", "Alternative", or "Difficulty" were removed, as these seemed to be descriptions rather than command lines.
### 5. Data Pruning
At this point, the dataset was ~95k rows, but had a long-tail distribution where a few CLI tools dominate the dataset while thousands of others are rare:
- kubectl: ~18,000 rows
- find: ~14,000 rows
- git: ~9,600 rows
- grep: ~1,000 rows
- mkdir: ~200 rows
If trained on this as-is, the model will bias heavily towards dominant tools and underperform on rarer ones.
Solution:
Implemented a hybrid approach combining:
1. Logarithmic Scaling:
- Applieded a log-based cap that compressed the power-law distribution
- Formula: `target = min(original, BASE_CAP + SCALE_FACTOR * log(original))`
- This allowed complex tools to retain more samples than simple tools, while preventing any single tool from dominating
2. Subcommand Stratification with Caps (for complex tools only):
- Tools like kubectl, git, and docker have distinct subcommands
(e.g., `kubectl get`, `kubectl apply`, `git commit`, `git push`)
- For these tools, we allocated proportionally across subcommands but capped any single subcommand at `MAX_SUBCOMMAND_PERCENTAGE (30%)` of the tool's budget.
- If a subcommand hit the cap, its leftover budget was shared among the smaller subcommands. This prevents dominant subcommands (e.g., docker ps at 50%) from crowding out rarer subcommands after pruning.
- e.g. Say `kubectl` gets a budget of 3,500 samples, and subcommand `get` has 60% of the original data. Proportionally, `get` would receive 2,100 samples, but the
30% cap limits it to 1,050. The leftover 1,050 (the "surplus" of 2,100) goes back into the pool and gets divided among the other subcommands like `apply`, `delete`, etc. This repeated in a loop until no
subcommand exceeds the cap.
3. Minimum Threshold:
- Tools with fewer than `MINIMUM_THRESHOLD` samples are kept entirely
- This preserves rare tools that are already underrepresented
### 6. Data Security
The pruned dataset was scanned for serious security risks. The `output` column was checked `gemini-2.5-flash-lite`, and a `security` column added, marking rows as `threat` or `safe`. Only extreme risks, like deleting the root directory or GitHub repositories, were flagged.
## Intermediary result
After the above processing, ~64,000 rows remained from the starting ~300k.
## Supplementation
1. The data was supplemented with the following grounding questions, each with the output pair `I am designed to help with command line tasks. I cannot answer general knowledge questions.`:
```
"Are you conscious?",
"Do you have feelings?",
"What is the capital of Paris?",
"Why is the sky blue?",
"What is the meaning of life?",
"How does gravity work?",
"When did World War II end?",
"Who was the first president of the United States?",
"What is 2+2?",
"How many planets are in the solar system?",
"What is the square root of 144?",
"Who wrote Romeo and Juliet?",
"What is photosynthesis?",
"Where is Mount Everest?",
"What is the speed of light?",
"Who painted the Mona Lisa?",
"What is DNA?",
"How old is the Earth?",
"What causes earthquakes?",
"Who invented the telephone?",
"What is the largest ocean?",
"Can you tell me a joke?",
"What is quantum physics?",
"How do airplanes fly?",
"What is climate change?"
```
2. Thousands of rows of synthetically generated data from ChatGPT4.5 and Claude Opus 4.6 were created.
3. This data was used to train a command line assistant from our organisation, "Spicy Lemonade". As such, the following data were also added for additional grounding,
each with the identity response `I am a small language model developed by Spicy Lemonade. I am designed to help with command line tasks.`:
```
"Who are you?",
"What is your name?",
"What should I call you?",
"Who created you?",
"Who made you?",
"What are you?",
"Can you introduce yourself?",
"Tell me about yourself",
"Who developed you?",
"What kind of model are you?",
"Which company made you?",
"Who built you?",
"What is your identity?",
"Can you tell me who you are?",
"What should I know about you?",
"Who are you exactly?",
"What is your origin?",
"Where do you come from?",
"Who designed you?",
"What organization created you?",
"Are you Claude?",
"Are you ChatGPT?",
"Are you from OpenAI?",
"Are you from Anthropic?",
"Are you from Gemini?",
"What company do you belong to?",
"Who owns you?",
"What is your purpose?",
"Why were you created?",
"What can you do?",
"What are your capabilities?",
```
## Result
The final result is a combination of the cleaned training data, synthetically generated data, and grounding data which resulted in 82,024 rows.
These were converted from the Alpaca format to the standard Qwen chat template format.
提供机构:
spicy-lemonade



