mesolitica/mixtral-magicoder
收藏Hugging Face2024-09-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/mixtral-magicoder
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- conversational
language:
- en
- ms
---
# Mixtral Magicoder: Source Code Is All You Need on various programming languages
We sampled programming languages from https://huggingface.co/datasets/bigcode/the-stack-dedup and pushed to https://huggingface.co/datasets/malaysia-ai/starcoderdata-sample
After that, we use [Magicoder: Source Code Is All You Need on various programming languages](https://github.com/ise-uiuc/magicoder) template, we target at least 10k rows for each programming languages.
1. C++, 10747 rows
2. C#, 10193 rows
3. CUDA, 13843 rows
4. Dockerfile, 13286 rows
5. Go, 10143 rows
6. Java, 11221 rows
7. JavaScript, 11758 rows
8. Kotlin, 12790 rows
9. PHP, 10176 rows
10. Python, other than `pandas` and `sklearn` and `matplotlib` and `plotly`, 10925 rows
11. Python, must have `pandas` or `sklearn` or `matplotlib` or `plotly`, focused on data analytics, 53959 rows
12. Ruby, 10201 rows
13. Rust, 10271 rows
14. Scala, 10017 rows
15. Shell, 10848 rows
16. SQL, 27668 rows
17. Swift, 10187 rows
18. TypeScript, 14248 rows
Source code at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/mixtral-magicoder
## precaution
1. There is no validation for the output generated.
2. Always filter short answers.
## Filtered version
1. Dropped short answers.
2. Dropped contain `code snippet`.
Uploaded at [postfilter.jsonl](postfilter.jsonl).
## Infrastructure specification
1. 5x of 4x A100s, NC96ads A100 v4, spot instance, total run is ~48 hours, 48 * 1.954 (US East, https://instances.vantage.sh/azure/vm/nc96ads-v4) * 5 ~= 376 USD.
2. HuggingFace Text Inference Engine.
提供机构:
mesolitica
原始信息汇总
Mixtral Magicoder: Source Code Is All You Need on various programming languages
数据集概述
该数据集是从bigcode/the-stack-dedup中采样编程语言数据,并推送到malaysia-ai/starcoderdata-sample。使用Magicoder: Source Code Is All You Need on various programming languages模板,目标是为每种编程语言至少收集10,000行代码。
数据集内容
数据集包含以下编程语言的代码行数:
- C++: 10,747行
- C#: 10,193行
- CUDA: 13,843行
- Dockerfile: 13,286行
- Go: 10,143行
- Java: 11,221行
- JavaScript: 11,758行
- Kotlin: 12,790行
- PHP: 10,176行
- Python(不包括
pandas、sklearn、matplotlib和plotly): 10,925行 - Python(必须包含
pandas、sklearn、matplotlib或plotly,专注于数据分析): 53,959行 - Ruby: 10,201行
- Rust: 10,271行
- Scala: 10,017行
- Shell: 10,848行
- SQL: 27,668行
- Swift: 10,187行
- TypeScript: 14,248行
数据处理
- 没有对生成的输出进行验证。
- 始终过滤短答案。
过滤版本
- 删除了短答案。
- 删除了包含
code snippet的内容。
过滤后的数据上传至postfilter.jsonl。
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个多语言编程代码数据集,基于Magicoder模板从多种编程语言(如C++、Python、SQL等)中采样生成,旨在为每种语言提供至少10k行代码示例。数据集主要用于代码生成和指令遵循任务,包含约262k行数据,并经过过滤以移除简短回答和代码片段,适用于大语言模型的预训练或微调。
以上内容由遇见数据集搜集并总结生成



