electrocampbell/nebula-8lang-203k

Name: electrocampbell/nebula-8lang-203k
Creator: electrocampbell
Published: 2026-04-12 02:30:24
License: 暂无描述

Hugging Face2026-04-12 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/electrocampbell/nebula-8lang-203k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en - code task_categories: - translation - text-generation tags: - code - code-translation - nebula size_categories: - 100K<n<1M --- # nebula-8lang-203k Training pairs for fine-tuning code translation models on [Nebula](https://github.com/colinc86/nebula), a universal code intermediate language. Each example is a (Nebula → target language) pair across 8 languages: Python, JavaScript, TypeScript, Go, Swift, Kotlin, Rust, C. ## Pipeline Source code is harvested from [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) (and [The Stack](https://huggingface.co/datasets/bigcode/the-stack) for Swift), parsed into individual functions, then converted to Nebula via the Nebula compiler's `from-{lang}` ingesters. Pairs that fail validation (trivial, error markers, length filters) are dropped. ## Format Each line in `train.jsonl` / `val.jsonl` is a chat-formatted SFT example: ```json {"messages": [ {"role": "system", "content": "You are a code translator. Given code in Nebula (a universal intermediate language), produce the equivalent idiomatic <Language> code. Output only the <Language> code, no explanations."}, {"role": "user", "content": "<nebula source>"}, {"role": "assistant", "content": "<target language source>"} ]} ``` ## Sizes | Split | Examples | |---|---| | Train | 203,336 | | Val | ~22,600 | | Per language | ~30,000 (8 langs) | ~30% of examples are multi-function programs (vs the single-function pairs in `nebula-8lang-68k`). Split: 90% train / 10% val. ## Models trained on this dataset - [`electrocampbell/nebula-8lang-14b`](https://huggingface.co/electrocampbell/nebula-8lang-14b) ## License Apache 2.0. Source data is from StarCoderData / The Stack, used under their respective licenses (permissively-licensed code only).

提供机构：

electrocampbell

5,000+

优质数据集

54 个

任务类型

进入经典数据集