rombodawg/code_bagel
收藏Hugging Face2024-10-08 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/rombodawg/code_bagel
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含超过800百万个唯一编码数据,支持超过100种编程语言。数据集包含320万行高质量、经过过滤、去重和去审查的编码数据。它是通过合并HuggingFace上最大和最高质量的基于指令的编码数据集创建的,适合用于继续预训练新的编码模型。数据集的创建过程包括下载各个数据集、使用Meta.ai提取数据并转换为alpaca格式、合并数据集、使用Claude.ai进行去重和去审查等步骤。README还提供了如何使用该数据集训练AI模型的详细指南,并列出了数据集中支持的编程语言及其在数据集中的出现频率。
This dataset contains over 800 million tokens of unique coding data, supporting over 100 programming languages. It includes 3.2 million lines of high-quality, filtered, deduplicated, and uncensored coding data. The dataset is created by merging the largest and highest quality instruction-based coding datasets on HuggingFace, making it suitable for continuing the pretraining of new coding models. The creation process involves downloading individual datasets, using Meta.ai to extract and format the data into alpaca format, merging the datasets, and using Claude.ai for deduplication and uncensoring. The README also provides detailed instructions on how to train AI models using this dataset and lists the supported programming languages along with their frequency in the dataset.
提供机构:
rombodawg



