MnemicAI/Ling-Coder-SFT-English-Clean
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/MnemicAI/Ling-Coder-SFT-English-Clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- code
- coding
- sft
- instruction-tuning
- english-only
- cleaned
- curated
pretty_name: "Ling-Coder SFT English Clean"
size_categories:
- 1M<n<5M
source_datasets:
- inclusionAI/Ling-Coder-SFT
dataset_info:
features:
- name: messages
dtype: string
- name: languages
dtype: string
- name: license
dtype: string
- name: difficulty
dtype: string
configs:
- config_name: default
data_files:
- split: train
path: "**/*.parquet"
---
# Ling-Coder-SFT-English-Clean
A cleaned, English-only version of [inclusionAI/Ling-Coder-SFT](https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT) — one of the largest open-source coding instruction datasets (~5.1M samples). Split by programming language for easy access.
**Curated by [MnemicAI](https://huggingface.co/MnemicAI)**
## Origin Story
While building our [Mnemic COCM-COT](https://huggingface.co/MnemicAI) training pipeline — a multi-language coding instruction dataset with stratified topic sampling — we discovered that **11.44% of Ling-Coder-SFT contains Chinese/CJK characters** mixed into what many assume is an English-only coding dataset.
This wasn't a bug in the original dataset. InclusionAI intentionally built it as bilingual (English + Chinese) for their Ling-Coder-Lite model. But for anyone fine-tuning an **English-only** code model, these 585,600 samples silently contaminate your training data and can degrade model quality.
Since we were already scanning 5.1M rows for our own pipeline, we figured — why not clean the whole thing and share it? It took 25 minutes on a free Colab instance.
**If this saves you time, give it a ⭐ and build something great.**
## What Changed
We scanned all 5,119,470 rows and removed every row containing Chinese/CJK characters.
| Metric | Original | This Version |
|--------|----------|--------------|
| Total rows | 5,119,470 | **4,533,870** |
| Chinese removed | 585,600 (11.44%) | **0** |
| Programming languages | 293 | 293 (unchanged) |
| Schema | Unchanged | Unchanged |
| License | Apache 2.0 | Apache 2.0 |
### Per-Language Breakdown (Top 25)
| Language | Rows | % of Dataset |
|----------|-----:|:-------------|
| Python | 3,156,935 | 69.6% |
| JavaScript | 116,446 | 2.6% |
| C# | 109,671 | 2.4% |
| C++ | 87,574 | 1.9% |
| Java | 82,112 | 1.8% |
| Go | 73,501 | 1.6% |
| Swift | 68,301 | 1.5% |
| Rust | 64,011 | 1.4% |
| TypeScript | 62,461 | 1.4% |
| PHP | 56,389 | 1.2% |
| D | 54,281 | 1.2% |
| R | 52,865 | 1.2% |
| Clojure | 51,098 | 1.1% |
| Bash | 51,018 | 1.1% |
| Lua | 45,281 | 1.0% |
| Haskell | 44,311 | 1.0% |
| Elixir | 43,639 | 1.0% |
| Scala | 41,572 | 0.9% |
| Julia | 41,408 | 0.9% |
| SQL | 40,939 | 0.9% |
| Ruby | 36,081 | 0.8% |
| Racket | 31,952 | 0.7% |
| C | 27,128 | 0.6% |
| Kotlin | 26,776 | 0.6% |
| HTML | 21,648 | 0.5% |
*...and 268 more languages including Dart, Solidity, Perl, Dockerfile, YAML, and others.*
## Dataset Structure
The dataset is organized by **programming language**, making it easy to download only what you need:
```
├── python/
│ ├── train-00000-of-00003.parquet
│ ├── train-00001-of-00003.parquet
│ └── train-00002-of-00003.parquet
├── java/
│ └── train-00000-of-00001.parquet
├── typescript/
│ └── train-00000-of-00001.parquet
├── rust/
│ └── train-00000-of-00001.parquet
├── go/
│ └── train-00000-of-00001.parquet
├── cpp/
│ └── train-00000-of-00001.parquet
├── csharp/
│ └── train-00000-of-00001.parquet
├── swift/
│ └── train-00000-of-00001.parquet
├── kotlin/
│ └── train-00000-of-00001.parquet
└── ... (20+ languages)
```
## Usage
### Load the full dataset
```python
from datasets import load_dataset
ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean")
```
### Load a specific language only
```python
from datasets import load_dataset
# Load only Python samples
python_ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean", data_dir="python")
# Load only Rust samples
rust_ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean", data_dir="rust")
```
### Example Row
```json
{
"messages": [
{"role": "user", "content": "Write a Python function to find the longest common subsequence..."},
{"role": "assistant", "content": "Here's a dynamic programming solution..."}
],
"languages": ["python"],
"license": "MIT",
"difficulty": "medium"
}
```
## Filtering Methodology
Our filtering pipeline uses a **zero-regex, C-level** approach for maximum speed:
```python
# Pre-built frozenset of 20,992 CJK codepoints (Unicode range U+4E00–U+9FFF)
_CJK_CHARS = frozenset(chr(c) for c in range(0x4e00, 0x9fff + 1))
def has_chinese(text):
"""C-level set intersection — no Python loops, no regex."""
return not _CJK_CHARS.isdisjoint(str(text))
```
- **Every message** in each row is scanned (user + assistant turns)
- If **any** message contains CJK characters, the entire row is removed
- Processing speed: ~6,000 rows/second on a free Colab instance
- Total processing time: ~15 minutes for 5.1M rows
### What Gets Filtered
- ❌ Instructions written entirely in Chinese
- ❌ Responses containing Chinese explanations mixed with code
- ❌ Mixed English-Chinese bilingual samples
- ✅ Code comments in English (kept)
- ✅ All programming language syntax (kept)
- ✅ Unicode strings in code examples that aren't CJK (kept)
## Intended Use
This dataset is designed for:
- 🎯 **Fine-tuning English-only code models** (SFT stage)
- 🎯 **Building coding assistants** that don't need Chinese support
- 🎯 **Research** on code generation and instruction following
- 🎯 **Language-specific training** (grab just the language folder you need)
## Attribution & License
This dataset is a filtered derivative of [inclusionAI/Ling-Coder-SFT](https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT), licensed under **Apache License 2.0**.
**Changes made:**
- Removed 585,600 rows containing Chinese/CJK characters (11.44% of the original dataset)
- Split into per-programming-language subdirectories
- No other modifications to the data
**Original dataset citation:**
```bibtex
@misc{lingcoder2025,
title={Ling-Coder: An Instruction-Tuned Code Large Language Model},
author={InclusionAI},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT}
}
```
**Curated by:** [MnemicAI](https://huggingface.co/MnemicAI)
---
提供机构:
MnemicAI



