MinimaML/cocktail-6b

Name: MinimaML/cocktail-6b
Creator: MinimaML
Published: 2025-12-23 20:59:47
License: 暂无描述

Hugging Face2025-12-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/MinimaML/cocktail-6b

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en - py size_categories: - 1B<n<10B tags: - synthetic - math - code - educational --- # The Cocktail Dataset (6B Tokens) A high-density, interleaved pre-training dataset designed for training 3B+ parameter models. It combines synthetic textbooks, advanced mathematical reasoning, and production-grade code into a single balanced stream. ### Composition (The Mix) The dataset is pre-shuffled and interleaved to ensure optimal distribution of domains. | Domain | Share | Sources | Description | | :------------- | :------ | :--------------------------------- | :------------------------------------------------------------ | | **Foundation** | **50%** | Cosmopedia v2, FineWeb-Edu | High-quality synthetic textbooks and educational web content. | | **Logic** | **30%** | Orca-Math, MetaMathQA, OpenMath | Diverse mathematical reasoning (2.4M unique items). | | **Code** | **20%** | The Stack v2 (Python), Glaive, SQL | Deduplicated, high-quality code and execution logic. | ### Technical Specifications * **Total Size**: ~5.6 Billion Tokens (22.35 GB). * **Format**: `uint32` binary files (Little Endian). * **Tokenizer**: Llama-3 (TikToken). * **Sequence Length**: Continuous stream (EOS tokens included). ### Usage instructions The dataset is stored as raw binary memory maps for maximum I/O throughput. **Loading in Python:** ```python import numpy as np # Path to file file_path = "code_6B.bin" # Load as memory-mapped array (Instant access) # Note: dtype is uint32 to support Llama-3 vocabulary (>65k) data = np.memmap(file_path, dtype=np.uint32, mode="r") print(f"Loaded {len(data)} tokens.") print(f"First 10 tokens: {data[:10]}") ``` ### File Structure * `foundation_6B.bin`: General knowledge and textbook data. * `logic_6B.bin`: Mathematical and reasoning data. * `code_6B.bin`: Programming language data.

提供机构：

MinimaML

5,000+

优质数据集

54 个

任务类型

进入经典数据集