PowerCoding
收藏魔搭社区2025-12-04 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/PowerInfer/PowerCoding
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains approximately 259 billion tokens of pretrain data generated using Qwen2.5-14B-Instruct, Qwen 2.5-32B-Instruct and Qwen 2.5-Coder-32B.
The dataset utilizes a [MGA-style](https://arxiv.org/abs/2502.04235) methodology and a [persona-driven](https://arxiv.org/abs/2406.20094) data synthesis methodology
to create diverse and comprehensive training data mainly from the [Yulan](https://arxiv.org/abs/2406.19853), the [Stack-V2](https://huggingface.co/datasets/common-pile/stackv2) and the [Pile](https://huggingface.co/datasets/EleutherAI/pile) datasets.
The dataset is available under the Apache 2.0 license.
# Bias, Risks, and Limitations
- This dataset is mainly in English.
- The dataset inherits the biases, errors, and omissions known to exist in data used for seed sources and models used for data generation.
- The dataset is synthetically generated and hence may contain inaccuracies that do not accurately reflect real-world phenomena.
- The synthetic nature of this dataset may limit its ability to generalize to real-world cases.
本仓库包含约2590亿个Token(Token)的预训练数据,这些数据由Qwen2.5-14B-Instruct、Qwen 2.5-32B-Instruct以及Qwen 2.5-Coder-32B生成。
本数据集采用了[MGA-style](https://arxiv.org/abs/2502.04235)范式与[persona-driven](https://arxiv.org/abs/2406.20094)人格驱动的数据合成方法,主要基于[Yulan](https://arxiv.org/abs/2406.19853)、[Stack-V2](https://huggingface.co/datasets/common-pile/stackv2)及[Pile](https://huggingface.co/datasets/EleutherAI/pile)数据集构建多样化且全面的训练数据。本数据集遵循Apache 2.0开源协议。
## 偏差、风险与局限性
- 本数据集主要以英文为主。
- 本数据集继承了种子源数据以及数据生成所用模型中已被发现存在的偏差、错误与疏漏。
- 本数据集为合成生成数据,因此可能包含无法准确反映现实世界现象的不准确内容。
- 本数据集的合成特性可能会限制其在真实场景中的泛化能力。
提供机构:
maas
创建时间:
2025-07-25



