five

rombodawg/code_bagel

收藏
Hugging Face2024-10-08 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/rombodawg/code_bagel
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含超过800百万个唯一编码数据,支持超过100种编程语言。数据集包含320万行高质量、经过过滤、去重和去审查的编码数据。它是通过合并HuggingFace上最大和最高质量的基于指令的编码数据集创建的,适合用于继续预训练新的编码模型。数据集的创建过程包括下载各个数据集、使用Meta.ai提取数据并转换为alpaca格式、合并数据集、使用Claude.ai进行去重和去审查等步骤。README还提供了如何使用该数据集训练AI模型的详细指南,并列出了数据集中支持的编程语言及其在数据集中的出现频率。

This dataset contains over 800 million tokens of unique coding data, supporting over 100 programming languages. It includes 3.2 million lines of high-quality, filtered, deduplicated, and uncensored coding data. The dataset is created by merging the largest and highest quality instruction-based coding datasets on HuggingFace, making it suitable for continuing the pretraining of new coding models. The creation process involves downloading individual datasets, using Meta.ai to extract and format the data into alpaca format, merging the datasets, and using Claude.ai for deduplication and uncensoring. The README also provides detailed instructions on how to train AI models using this dataset and lists the supported programming languages along with their frequency in the dataset.
提供机构:
rombodawg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作