ibrahim2806/Arc-120B-PreTraining-Dataset
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ibrahim2806/Arc-120B-PreTraining-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
Arc 120B预训练数据集是为训练一个1200亿参数的语言模型Arc而设计的完整预训练数据配方。该模型具有以下特点:1. **代码优先**:35%的训练数据是跨619种语言的高质量代码;2. **适合黑客马拉松**:包含独特的训练数据,教授如何创建获胜的黑客马拉松演示;3. **知识广博**:涵盖来自开放网络、数学和学术数据集的一般知识;4. **直接且诚实**:训练模型直接回答问题,承认不确定性,避免幻觉。数据集包含4.8万亿令牌,分为代码(35%)、网络/一般内容(35%)、数学/推理(12%)、对齐/诚实(8%)、演示(5%)和多语言(5%)等几个领域。此外,数据集还包含独特的种子数据,如黑客马拉松演示策略、诚实/直接性训练数据、多种编程语言的代码示例等。
The Arc 120B Pre-Training Dataset is the complete pre-training data recipe for Arc — a 120B parameter language model designed to be: 1. **Code-first**: 35% of training data is high-quality code across 619 languages; 2. **Hackathon-ready**: Unique training data teaching how to create winning hackathon presentations; 3. **Knowledgeable**: Broad general knowledge from the best open web, math, and academic datasets; 4. **Direct & Honest**: Trained to be straightforward, admit uncertainty, and never hallucinate. The dataset consists of 4.8 trillion tokens, divided into domains such as Code (35%), Web/General (35%), Math/Reasoning (12%), Alignment/Honesty (8%), Presentations (5%), and Multilingual (5%). It also includes unique seed data like hackathon presentation strategies, honesty/directness training data, and code examples in various programming languages.
提供机构:
ibrahim2806



