five

MnemicAI/Ling-Coder-SFT-English-Clean

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/MnemicAI/Ling-Coder-SFT-English-Clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - code - coding - sft - instruction-tuning - english-only - cleaned - curated pretty_name: "Ling-Coder SFT English Clean" size_categories: - 1M<n<5M source_datasets: - inclusionAI/Ling-Coder-SFT dataset_info: features: - name: messages dtype: string - name: languages dtype: string - name: license dtype: string - name: difficulty dtype: string configs: - config_name: default data_files: - split: train path: "**/*.parquet" --- # Ling-Coder-SFT-English-Clean A cleaned, English-only version of [inclusionAI/Ling-Coder-SFT](https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT) — one of the largest open-source coding instruction datasets (~5.1M samples). Split by programming language for easy access. **Curated by [MnemicAI](https://huggingface.co/MnemicAI)** ## Origin Story While building our [Mnemic COCM-COT](https://huggingface.co/MnemicAI) training pipeline — a multi-language coding instruction dataset with stratified topic sampling — we discovered that **11.44% of Ling-Coder-SFT contains Chinese/CJK characters** mixed into what many assume is an English-only coding dataset. This wasn't a bug in the original dataset. InclusionAI intentionally built it as bilingual (English + Chinese) for their Ling-Coder-Lite model. But for anyone fine-tuning an **English-only** code model, these 585,600 samples silently contaminate your training data and can degrade model quality. Since we were already scanning 5.1M rows for our own pipeline, we figured — why not clean the whole thing and share it? It took 25 minutes on a free Colab instance. **If this saves you time, give it a ⭐ and build something great.** ## What Changed We scanned all 5,119,470 rows and removed every row containing Chinese/CJK characters. | Metric | Original | This Version | |--------|----------|--------------| | Total rows | 5,119,470 | **4,533,870** | | Chinese removed | 585,600 (11.44%) | **0** | | Programming languages | 293 | 293 (unchanged) | | Schema | Unchanged | Unchanged | | License | Apache 2.0 | Apache 2.0 | ### Per-Language Breakdown (Top 25) | Language | Rows | % of Dataset | |----------|-----:|:-------------| | Python | 3,156,935 | 69.6% | | JavaScript | 116,446 | 2.6% | | C# | 109,671 | 2.4% | | C++ | 87,574 | 1.9% | | Java | 82,112 | 1.8% | | Go | 73,501 | 1.6% | | Swift | 68,301 | 1.5% | | Rust | 64,011 | 1.4% | | TypeScript | 62,461 | 1.4% | | PHP | 56,389 | 1.2% | | D | 54,281 | 1.2% | | R | 52,865 | 1.2% | | Clojure | 51,098 | 1.1% | | Bash | 51,018 | 1.1% | | Lua | 45,281 | 1.0% | | Haskell | 44,311 | 1.0% | | Elixir | 43,639 | 1.0% | | Scala | 41,572 | 0.9% | | Julia | 41,408 | 0.9% | | SQL | 40,939 | 0.9% | | Ruby | 36,081 | 0.8% | | Racket | 31,952 | 0.7% | | C | 27,128 | 0.6% | | Kotlin | 26,776 | 0.6% | | HTML | 21,648 | 0.5% | *...and 268 more languages including Dart, Solidity, Perl, Dockerfile, YAML, and others.* ## Dataset Structure The dataset is organized by **programming language**, making it easy to download only what you need: ``` ├── python/ │ ├── train-00000-of-00003.parquet │ ├── train-00001-of-00003.parquet │ └── train-00002-of-00003.parquet ├── java/ │ └── train-00000-of-00001.parquet ├── typescript/ │ └── train-00000-of-00001.parquet ├── rust/ │ └── train-00000-of-00001.parquet ├── go/ │ └── train-00000-of-00001.parquet ├── cpp/ │ └── train-00000-of-00001.parquet ├── csharp/ │ └── train-00000-of-00001.parquet ├── swift/ │ └── train-00000-of-00001.parquet ├── kotlin/ │ └── train-00000-of-00001.parquet └── ... (20+ languages) ``` ## Usage ### Load the full dataset ```python from datasets import load_dataset ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean") ``` ### Load a specific language only ```python from datasets import load_dataset # Load only Python samples python_ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean", data_dir="python") # Load only Rust samples rust_ds = load_dataset("MnemicAI/Ling-Coder-SFT-English-Clean", data_dir="rust") ``` ### Example Row ```json { "messages": [ {"role": "user", "content": "Write a Python function to find the longest common subsequence..."}, {"role": "assistant", "content": "Here's a dynamic programming solution..."} ], "languages": ["python"], "license": "MIT", "difficulty": "medium" } ``` ## Filtering Methodology Our filtering pipeline uses a **zero-regex, C-level** approach for maximum speed: ```python # Pre-built frozenset of 20,992 CJK codepoints (Unicode range U+4E00–U+9FFF) _CJK_CHARS = frozenset(chr(c) for c in range(0x4e00, 0x9fff + 1)) def has_chinese(text): """C-level set intersection — no Python loops, no regex.""" return not _CJK_CHARS.isdisjoint(str(text)) ``` - **Every message** in each row is scanned (user + assistant turns) - If **any** message contains CJK characters, the entire row is removed - Processing speed: ~6,000 rows/second on a free Colab instance - Total processing time: ~15 minutes for 5.1M rows ### What Gets Filtered - ❌ Instructions written entirely in Chinese - ❌ Responses containing Chinese explanations mixed with code - ❌ Mixed English-Chinese bilingual samples - ✅ Code comments in English (kept) - ✅ All programming language syntax (kept) - ✅ Unicode strings in code examples that aren't CJK (kept) ## Intended Use This dataset is designed for: - 🎯 **Fine-tuning English-only code models** (SFT stage) - 🎯 **Building coding assistants** that don't need Chinese support - 🎯 **Research** on code generation and instruction following - 🎯 **Language-specific training** (grab just the language folder you need) ## Attribution & License This dataset is a filtered derivative of [inclusionAI/Ling-Coder-SFT](https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT), licensed under **Apache License 2.0**. **Changes made:** - Removed 585,600 rows containing Chinese/CJK characters (11.44% of the original dataset) - Split into per-programming-language subdirectories - No other modifications to the data **Original dataset citation:** ```bibtex @misc{lingcoder2025, title={Ling-Coder: An Instruction-Tuned Code Large Language Model}, author={InclusionAI}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/inclusionAI/Ling-Coder-SFT} } ``` **Curated by:** [MnemicAI](https://huggingface.co/MnemicAI) ---
提供机构:
MnemicAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作