five

MinimaML/cocktail-6b

收藏
Hugging Face2025-12-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MinimaML/cocktail-6b
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en - py size_categories: - 1B<n<10B tags: - synthetic - math - code - educational --- # The Cocktail Dataset (6B Tokens) A high-density, interleaved pre-training dataset designed for training 3B+ parameter models. It combines synthetic textbooks, advanced mathematical reasoning, and production-grade code into a single balanced stream. ### Composition (The Mix) The dataset is pre-shuffled and interleaved to ensure optimal distribution of domains. | Domain | Share | Sources | Description | | :------------- | :------ | :--------------------------------- | :------------------------------------------------------------ | | **Foundation** | **50%** | Cosmopedia v2, FineWeb-Edu | High-quality synthetic textbooks and educational web content. | | **Logic** | **30%** | Orca-Math, MetaMathQA, OpenMath | Diverse mathematical reasoning (2.4M unique items). | | **Code** | **20%** | The Stack v2 (Python), Glaive, SQL | Deduplicated, high-quality code and execution logic. | ### Technical Specifications * **Total Size**: ~5.6 Billion Tokens (22.35 GB). * **Format**: `uint32` binary files (Little Endian). * **Tokenizer**: Llama-3 (TikToken). * **Sequence Length**: Continuous stream (EOS tokens included). ### Usage instructions The dataset is stored as raw binary memory maps for maximum I/O throughput. **Loading in Python:** ```python import numpy as np # Path to file file_path = "code_6B.bin" # Load as memory-mapped array (Instant access) # Note: dtype is uint32 to support Llama-3 vocabulary (>65k) data = np.memmap(file_path, dtype=np.uint32, mode="r") print(f"Loaded {len(data)} tokens.") print(f"First 10 tokens: {data[:10]}") ``` ### File Structure * `foundation_6B.bin`: General knowledge and textbook data. * `logic_6B.bin`: Mathematical and reasoning data. * `code_6B.bin`: Programming language data.
提供机构:
MinimaML
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作