JoTeqtheFirstAI/fineweb-edu-dedup6m

Name: JoTeqtheFirstAI/fineweb-edu-dedup6m
Creator: JoTeqtheFirstAI
Published: 2026-02-01 13:14:21
License: 暂无描述

Hugging Face2026-02-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JoTeqtheFirstAI/fineweb-edu-dedup6m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - fineweb-edu-dedup - auditing - iso-27001 size_categories: - 1M<n<10M dataset_info: features: - name: text dtype: string - name: source dtype: string - name: score dtype: float64 splits: - name: train num_examples: 6000000 num_shards: 120 --- # Stage 1 (S1): General Knowledge Anchor — 6M FineWeb-Edu-Dedup ## 1. Project Overview This dataset represents the **General Knowledge Acquisition Phase (S1)** for a research project focused on developing a Domain-Adaptive LLM for **ISO 27001 Information Security Auditing**. S1 serves as the cognitive foundation. This corpus is designed to establish high-level linguistic proficiency and general reasoning before the introduction of specialized regulatory standards in Stage 2. ## 2. Dataset Summary - **Total Samples:** 6,000,000 - **Primary Source:** [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/fineweb-edu-dedup) - **Role:** General Reasoning & Scientific Logic Base. ## 3. Sourcing & Preprocessing (S1 Methodology) The sourcing logic for this 6M slice prioritized **Knowledge Density** over raw volume: * **Educational Filtering:** Only samples with a high "educational score" (classifier-based) were retained to ensure the model learns professional and structured language. * **Sharding:** Organized into 120 Parquet shards to support high-throughput, multi-node training. ## 4. Technical Specifications | Parameter | Value | | :--- | :--- | | **Format** | Parquet (Compressed) | | **Average Sequence Length** | 600 - 4096 tokens | | **Language** | English (High-Proficiency) | ## 5. Usage in Continual Pre-training This dataset is intended to be interleaved with **Math/Code** and **Multilingual** streams to reach a Stage 1 target of 10B tokens. ### Loading for Training (Streaming) ```python from datasets import load_dataset dataset = load_dataset("JoTeqtheFirstAI/fineweb-edu-dedup6m", split="train", streaming=True)

提供机构：

JoTeqtheFirstAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集