electrocampbell/nebula-8lang-203k
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electrocampbell/nebula-8lang-203k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- code
task_categories:
- translation
- text-generation
tags:
- code
- code-translation
- nebula
size_categories:
- 100K<n<1M
---
# nebula-8lang-203k
Training pairs for fine-tuning code translation models on [Nebula](https://github.com/colinc86/nebula), a universal code intermediate language. Each example is a (Nebula → target language) pair across 8 languages: Python, JavaScript, TypeScript, Go, Swift, Kotlin, Rust, C.
## Pipeline
Source code is harvested from [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) (and [The Stack](https://huggingface.co/datasets/bigcode/the-stack) for Swift), parsed into individual functions, then converted to Nebula via the Nebula compiler's `from-{lang}` ingesters. Pairs that fail validation (trivial, error markers, length filters) are dropped.
## Format
Each line in `train.jsonl` / `val.jsonl` is a chat-formatted SFT example:
```json
{"messages": [
{"role": "system", "content": "You are a code translator. Given code in Nebula (a universal intermediate language), produce the equivalent idiomatic <Language> code. Output only the <Language> code, no explanations."},
{"role": "user", "content": "<nebula source>"},
{"role": "assistant", "content": "<target language source>"}
]}
```
## Sizes
| Split | Examples |
|---|---|
| Train | 203,336 |
| Val | ~22,600 |
| Per language | ~30,000 (8 langs) |
~30% of examples are multi-function programs (vs the single-function pairs in `nebula-8lang-68k`).
Split: 90% train / 10% val.
## Models trained on this dataset
- [`electrocampbell/nebula-8lang-14b`](https://huggingface.co/electrocampbell/nebula-8lang-14b)
## License
Apache 2.0. Source data is from StarCoderData / The Stack, used under their respective licenses (permissively-licensed code only).
提供机构:
electrocampbell



