ReactiveAI/Beta-Code
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ReactiveAI/Beta-Code
下载链接
链接失效反馈官方服务:
资源简介:
Reactive AI / Beta Code 是一个基于代码的预训练语料库,专为 RxT-Beta 模型设计,来源于公开和开放的数据集。该数据集包含多种编程语言的代码,如 C、C#、C++、HTML、Java、PHP、JavaScript (js)、Markdown (md)、Python (py) 和 TypeScript (ts),并根据代码长度分为短代码(< ~1024 tokens)和长代码(> ~1024 tokens)两类。数据集是从 codeparrot 数据集创建的,其中 Python 子集来自 codeparrot/codeparrot-clean,其他子集来自 codeparrot/github-code-clean。数据集采用 Apache-2.0 许可证,任务类别为文本生成,语言为英语。
Reactive AI / Beta Code is a code-based pre-training corpus for RxT-Beta models, created from public & open datasets. It includes code in different programming languages such as C, C#, C++, HTML, Java, PHP, JavaScript (js), Markdown (md), Python (py), and TypeScript (ts), with subsets divided into short (< ~1024 tokens) and long (> ~1024 tokens) categories. The dataset is derived from codeparrot datasets, with Python subsets from codeparrot/codeparrot-clean and other subsets from codeparrot/github-code-clean. It is licensed under Apache-2.0, categorized under text-generation tasks, and the language is English.
提供机构:
ReactiveAI



