five

electrocampbell/nebula-8lang-203k

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/electrocampbell/nebula-8lang-203k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - code task_categories: - translation - text-generation tags: - code - code-translation - nebula size_categories: - 100K<n<1M --- # nebula-8lang-203k Training pairs for fine-tuning code translation models on [Nebula](https://github.com/colinc86/nebula), a universal code intermediate language. Each example is a (Nebula → target language) pair across 8 languages: Python, JavaScript, TypeScript, Go, Swift, Kotlin, Rust, C. ## Pipeline Source code is harvested from [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) (and [The Stack](https://huggingface.co/datasets/bigcode/the-stack) for Swift), parsed into individual functions, then converted to Nebula via the Nebula compiler's `from-{lang}` ingesters. Pairs that fail validation (trivial, error markers, length filters) are dropped. ## Format Each line in `train.jsonl` / `val.jsonl` is a chat-formatted SFT example: ```json {"messages": [ {"role": "system", "content": "You are a code translator. Given code in Nebula (a universal intermediate language), produce the equivalent idiomatic <Language> code. Output only the <Language> code, no explanations."}, {"role": "user", "content": "<nebula source>"}, {"role": "assistant", "content": "<target language source>"} ]} ``` ## Sizes | Split | Examples | |---|---| | Train | 203,336 | | Val | ~22,600 | | Per language | ~30,000 (8 langs) | ~30% of examples are multi-function programs (vs the single-function pairs in `nebula-8lang-68k`). Split: 90% train / 10% val. ## Models trained on this dataset - [`electrocampbell/nebula-8lang-14b`](https://huggingface.co/electrocampbell/nebula-8lang-14b) ## License Apache 2.0. Source data is from StarCoderData / The Stack, used under their respective licenses (permissively-licensed code only).
提供机构:
electrocampbell
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作