five

RefineCode-code-corpus-meta

收藏
魔搭社区2025-10-11 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/infly/RefineCode-code-corpus-meta
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset consists of meta information (including the repository name and file path) of the raw code data from **RefineCode**. You can collect those files referring to this metadata and reproduce **RefineCode**! ***Note:** Currently, we have uploaded the meta data covered by The Stack V2 (About 50% file volume). Due to complex legal considerations, we are unable to provide the complete source code currently. We are working hard to make the remaining part available.* --- **RefineCode** is a **high-quality**, **reproducible** code pretraining corpus comprising **960 billion** tokens across **607** programming languages and **75 billion** code-related token recalled from web corpus, incorporating over **130** language-specific rules with customized weight assignments. Our dataset shows better training efficacy and efficiency compared with the training subset of The Stack V2. <img src="https://raw.githubusercontent.com/OpenCoder-llm/opencoder-llm.github.io/refs/heads/main/static/images/opencoder_banner.png" alt="OpenCoder banner" style="zoom:30%;" /> We also use PCA to visualize the embeddings extracted from CodeBERT for The Stack V2 and **RefineCode**, showing a clear advance of our pretraining dataset. <img src="https://raw.githubusercontent.com/OpenCoder-llm/opencoder-llm.github.io/refs/heads/main/static/images/compare_refinecode_stack_v2.jpg" alt="Distribution Comparsion" style="zoom:50%;" />

本数据集包含来自**RefineCode**的原始代码数据的元信息(包括仓库名称与文件路径)。您可依据此元信息采集对应文件,复现**RefineCode**数据集! ***注意:** 目前我们仅上传了The Stack V2覆盖的元数据(约占文件总量的50%)。由于复杂的法律考量,目前暂无法提供完整源代码。我们正积极推进剩余部分的开源工作。* --- **RefineCode**是一款高质量、可复现的代码预训练语料库,涵盖607种编程语言的9600亿个Token(Token),以及从网页语料库中召回的750亿个代码相关Token;该语料库集成了超过130条语言专属规则,并进行了自定义权重分配。相较于The Stack V2的训练子集,本数据集展现出更优异的训练效果与训练效率。 <img src="https://raw.githubusercontent.com/OpenCoder-llm/opencoder-llm.github.io/refs/heads/main/static/images/opencoder_banner.png" alt="OpenCoder 横幅" style="zoom:30%;" /> 我们还使用主成分分析(PCA,Principal Component Analysis)对The Stack V2与**RefineCode**从CodeBERT中提取的嵌入向量进行可视化,结果清晰展现了本预训练数据集的性能优势。 <img src="https://raw.githubusercontent.com/OpenCoder-llm/opencoder-llm.github.io/refs/heads/main/static/images/compare_refinecode_stack_v2.jpg" alt="分布对比图" style="zoom:50%;" />
提供机构:
maas
创建时间:
2024-11-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作