five

Yxanul/python-finest-pretrain

收藏
Hugging Face2025-08-25 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Yxanul/python-finest-pretrain
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个高质量的Python预训练数据集,采用Parquet格式存储,包含了经过精心策划和验证的Python代码。该数据集在代码生成基准测试中比Stack v2性能提高了50%。数据集的每个代码样本都经过了语法验证,每个函数都包含了类型注解和文档字符串,并且所有代码都由LLama-3.3-70B模型进行了重写以提高清晰度。所有示例都是自包含的,可以在没有外部依赖的情况下运行,并包含了真实世界的算法、设计模式和最佳实践。数据集共有约260万个Python代码示例,采用Parquet格式和snappy压缩,每个文件包含10万个样本以便高效加载。平均样本长度约为5000个字符,压缩比约为4.5倍。

This is a high-quality Python pretraining dataset in Parquet format, containing meticulously curated and validated Python code. This dataset offers 50% better performance than Stack v2 on code generation benchmarks. Each code sample is syntax-validated, includes type hints and docstrings, and has been rewritten by Llama-3.3-70B for clarity. The examples are self-contained, runnable without external dependencies, and include real-world algorithms, design patterns, and best practices. The dataset consists of approximately 2.6 million Python code examples, formatted in Parquet with snappy compression, with each file holding 100,000 samples for efficient loading. The average sample length is about 5,000 characters, and the compression ratio is approximately 4.5x.
提供机构:
Yxanul
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作