JoTeqtheFirstAI/fineweb-edu-dedup6m
收藏Hugging Face2026-02-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JoTeqtheFirstAI/fineweb-edu-dedup6m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- fineweb-edu-dedup
- auditing
- iso-27001
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
- name: score
dtype: float64
splits:
- name: train
num_examples: 6000000
num_shards: 120
---
# Stage 1 (S1): General Knowledge Anchor — 6M FineWeb-Edu-Dedup
## 1. Project Overview
This dataset represents the **General Knowledge Acquisition Phase (S1)** for a research project focused on developing a Domain-Adaptive LLM for **ISO 27001 Information Security Auditing**.
S1 serves as the cognitive foundation. This corpus is designed to establish high-level linguistic proficiency and general reasoning before the introduction of specialized regulatory standards in Stage 2.
## 2. Dataset Summary
- **Total Samples:** 6,000,000
- **Primary Source:** [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/viewer/fineweb-edu-dedup)
- **Role:** General Reasoning & Scientific Logic Base.
## 3. Sourcing & Preprocessing (S1 Methodology)
The sourcing logic for this 6M slice prioritized **Knowledge Density** over raw volume:
* **Educational Filtering:** Only samples with a high "educational score" (classifier-based) were retained to ensure the model learns professional and structured language.
* **Sharding:** Organized into 120 Parquet shards to support high-throughput, multi-node training.
## 4. Technical Specifications
| Parameter | Value |
| :--- | :--- |
| **Format** | Parquet (Compressed) |
| **Average Sequence Length** | 600 - 4096 tokens |
| **Language** | English (High-Proficiency) |
## 5. Usage in Continual Pre-training
This dataset is intended to be interleaved with **Math/Code** and **Multilingual** streams to reach a Stage 1 target of 10B tokens.
### Loading for Training (Streaming)
```python
from datasets import load_dataset
dataset = load_dataset("JoTeqtheFirstAI/fineweb-edu-dedup6m", split="train", streaming=True)
提供机构:
JoTeqtheFirstAI



