five

JoTeqtheFirstAI/fineweb-edu-dedup6m

收藏
Hugging Face2026-02-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JoTeqtheFirstAI/fineweb-edu-dedup6m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - fineweb-edu-dedup - auditing - iso-27001 size_categories: - 1M<n<10M dataset_info: features: - name: text dtype: string - name: source dtype: string - name: score dtype: float64 splits: - name: train num_examples: 6000000 num_shards: 120 --- # Stage 1 (S1): General Knowledge Anchor — 6M FineWeb-Edu-Dedup ## 1. Project Overview This dataset represents the **General Knowledge Acquisition Phase (S1)** for a research project focused on developing a Domain-Adaptive LLM for **ISO 27001 Information Security Auditing**. S1 serves as the cognitive foundation. This corpus is designed to establish high-level linguistic proficiency and general reasoning before the introduction of specialized regulatory standards in Stage 2. ## 2. Dataset Summary - **Total Samples:** 6,000,000 - **Primary Source:** [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/fineweb-edu-dedup) - **Role:** General Reasoning & Scientific Logic Base. ## 3. Sourcing & Preprocessing (S1 Methodology) The sourcing logic for this 6M slice prioritized **Knowledge Density** over raw volume: * **Educational Filtering:** Only samples with a high "educational score" (classifier-based) were retained to ensure the model learns professional and structured language. * **Sharding:** Organized into 120 Parquet shards to support high-throughput, multi-node training. ## 4. Technical Specifications | Parameter | Value | | :--- | :--- | | **Format** | Parquet (Compressed) | | **Average Sequence Length** | 600 - 4096 tokens | | **Language** | English (High-Proficiency) | ## 5. Usage in Continual Pre-training This dataset is intended to be interleaved with **Math/Code** and **Multilingual** streams to reach a Stage 1 target of 10B tokens. ### Loading for Training (Streaming) ```python from datasets import load_dataset dataset = load_dataset("JoTeqtheFirstAI/fineweb-edu-dedup6m", split="train", streaming=True)
提供机构:
JoTeqtheFirstAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作