OpenDCAI/dataflow-demo-code
收藏Hugging Face2025-12-24 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/OpenDCAI/dataflow-demo-code
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是DataFlow项目中代码数据处理管道的演示,提供了经过处理和验证的代码SFT监督对。数据集的目的是通过多阶段处理管道将原始指令数据转换为高质量的监督微调(SFT)代码数据。处理步骤包括指令增强、代码生成、质量过滤和沙盒过滤。最终输出是一个经过筛选的指令-代码对数据集,适用于训练代码生成模型。数据集包含三种不同规模的子集:DataFlow-Code-1K、DataFlow-Code-5K和DataFlow-Code-10K。每个样本包含两个字段:generated_instruction(增强后的指令)和generated_code(通过质量检查和沙盒执行的代码解决方案)。
This dataset is a demo of the DataFlow Code data processing pipeline from the DataFlow project. It provides a lightweight, inspectable view of what the pipeline produces: curated, execution-checked code SFT supervision pairs. The purpose of the Code pipeline is to transform raw instruction data into high-quality supervised fine-tuning (SFT) code data through a multi-stage processing pipeline including instruction enhancement, code generation, quality filtering, and sandbox filtering. The final output is a curated dataset of instruction-code pairs suitable for training code generation models. The dataset includes three subsets at different scales: DataFlow-Code-1K, DataFlow-Code-5K, and DataFlow-Code-10K. Each sample contains two fields: generated_instruction (the enhanced instruction) and generated_code (the corresponding code solution that has passed quality checks and sandbox execution).
提供机构:
OpenDCAI



