ReactiveAI/Beta-Pre-Train-Corpus
收藏Hugging Face2026-02-28 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/ReactiveAI/Beta-Pre-Train-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
Reactive AI / Beta预训练语料库是为RxT-Beta模型创建的预训练语料库,来源于公开和开放的数据集。包含高质量的英语和波兰语网络爬取数据、数学和科学子集,以及不同编程语言的代码。数据集分为多个子集,如FineWeb-Edu、FineWiki、FineWeb2-HQ、FineMath、ProofPile-2和Notebooks,每个子集都有具体的示例数量和大小。
Pre-training corpus for RxT-Beta models, created from public & open datasets. Includes high-quality english and polish web crawl data, mathematic and scientific subsets, and code in different programming languages. The dataset is divided into multiple subsets such as FineWeb-Edu, FineWiki, FineWeb2-HQ, FineMath, ProofPile-2, and Notebooks, each with specific examples and sizes.
提供机构:
ReactiveAI



