mkd-chanwoo/filtered-datasets-for-koreanLLM
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mkd-chanwoo/filtered-datasets-for-koreanLLM
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- ko
license: other
task_categories:
- text-generation
tags:
- pretraining
- nlp
- korean
- english
- code
- science
- corpus
- filtered
- quality-filtered
- jsonl
pretty_name: Filtered Datasets for Korean LLM (Stage 1)
size_categories:
- 100M<n<1B
---
# Filtered Datasets for Korean LLM
> **Stage 1 — Filtering output** of the Keural Korean LLM pretraining pipeline.
> Every document has passed a 3-stage filter: quality, language detection, and toxicity/PII removal.
> Input: ~334M documents → Output: **~293M documents (87.8% pass rate)**.
---
## Quick Stats
| Metric | Value |
|--------|-------|
| Total input documents | 334,283,705 |
| Total documents passed | 293,386,402 |
| Overall pass rate | 87.8% |
| Domains | English · Korean · Code · Science |
| Format | JSONL (one JSON object per line) |
| Filters applied | Quality · Language · Toxicity/PII |
| Tokenizer | keural SentencePiece (`mkd-ai/keural-tokenizer`) |
| Last updated | 2026-04-09 |
---
## Where This Fits in the Pipeline
```mermaid
flowchart LR
A["Stage 0\nRaw Download"] --> B["Stage 0.5\nNormalization\nmkd-chanwoo/normalized-datasets-for-koreanLLM"]
B --> C["Stage 1\nFiltering\n← YOU ARE HERE\n293M docs / 248B tokens"]
C --> D["Stage 2\nDedup + Shard\nmkd-chanwoo/keural-datasets\n329M docs / 220B tokens"]
style C fill:#f0a500,color:#000
```
---
## For Beginners: What Does Filtering Do?
Raw text from the internet contains a lot of noise:
- Documents in the wrong language (e.g., a French article in an English dataset)
- Very short or very long documents that are likely junk
- Documents with too many numbers, URLs, or repeated lines (often scraped boilerplate)
- Documents containing toxic content, hate speech, or personal information (phone numbers, resident IDs)
**Stage 1 filtering** removes or redacts these problems so that only clean, high-quality documents proceed to the next stage.
Three filters run **sequentially** on every document. If a document fails any one filter, it is discarded.
---
## Filter 1 — Quality Filter
Removes structurally poor documents based on statistical text properties.
| Rule | Threshold | Action |
|------|-----------|--------|
| Minimum document length | < 200 characters | ❌ Reject |
| Maximum document length | > 1,000,000 characters | ❌ Reject |
| Digit character ratio | > 30% of all characters | ❌ Reject |
| Repeated line ratio | > 20% of lines are duplicates | ❌ Reject |
| Bullet point / list ratio | > 90% of lines start with bullet | ❌ Reject |
| HTML tag ratio | > 10% HTML tags in content | ❌ Reject |
---
## Filter 2 — Language Detection Filter
Ensures each document is actually in the language its domain declares (English or Korean).
Code datasets are **exempt** from language filtering (code is language-agnostic).
| Setting | Value |
|---------|-------|
| Primary detector | **FastText** `lid.176.bin` (supports 176 languages) |
| Fallback detector | `langdetect` (used if FastText is inconclusive) |
| Confidence threshold | **0.75** (75% minimum certainty) |
| Applied to | English domain, Korean domain |
| Not applied to | Code domain, Science domain |
**How it works:**
1. FastText analyzes the document text
2. Returns predicted language + confidence score
3. If predicted language matches domain AND confidence ≥ 0.75 → ✅ Pass
4. Otherwise → ❌ Reject
---
## Filter 3 — Toxicity & PII Filter
Removes or redacts documents containing harmful or private content.
| Pattern | Action |
|---------|--------|
| Korean resident registration number (주민등록번호) | ❌ Entire document removed |
| Credit card number | ❌ Entire document removed |
| Profanity (English and Korean) | ❌ Entire document removed |
| Spam keyword patterns | ❌ Entire document removed |
| Phone numbers | ✏️ Redacted → `[PHONE]` |
| Email addresses | ✏️ Redacted → `[EMAIL]` |
---
## Filter Pipeline Flow
```mermaid
flowchart TD
IN["Normalized Document\n(from Stage 0.5)"] --> Q["Quality Filter\nchar count · digit ratio\nrepeat ratio · HTML ratio"]
Q -->|Fail| REJECT1["❌ Rejected\n(quality)"]
Q -->|Pass| L["Language Filter\nFastText lid.176.bin\nconfidence ≥ 0.75"]
L -->|Fail| REJECT2["❌ Rejected\n(language)"]
L -->|Pass| T["Toxicity / PII Filter\nresident ID · credit card\nprofanity · spam"]
T -->|Fail / Redact| REJECT3["❌ Removed or Redacted\n(toxicity / PII)"]
T -->|Pass| OUT["✅ Filtered Document\n(written to Stage 1 output)"]
```
---
## Per-Dataset Filter Statistics
| Dataset | Domain | Input | Passed | Pass Rate | Quality Failed | Lang Failed | Toxic Failed |
|---------|--------|-------|--------|-----------|---------------|-------------|--------------|
| gutenberg | english | 48,284 | 42,914 | 88.9% | 2,428 | 1,735 | 1,198 |
| openwebtext | english | 8,013,769 | 7,560,439 | 94.3% | 54,443 | 172,473 | 226,414 |
| ccnews | english | 71,629,440 | 67,512,723 | 94.3% | 2,084,287 | 1,188,980 | 843,450 |
| falcon-refinedweb | english | 59,839,870 | 50,720,322 | 84.8% | 2,783,522 | 4,781,279 | 1,537,584 |
| fineweb | english | 56,181,731 | 49,909,714 | 88.8% | 1,150,071 | 2,361,574 | 2,760,372 |
| wikipedia (en) | english | 6,407,814 | 5,488,464 | 85.7% | 342,285 | 435,857 | 141,061 |
| namuwiki | korean | 565,293 | 474,367 | 83.9% | 52,389 | 5,364 | 32,573 |
| wikipedia_ko | korean | 515,425 | 472,758 | 91.7% | 16,148 | 13,005 | 13,514 |
| oscar_ko_only | korean | 3,675,420 | 2,541,094 | 69.1% | 867,746 | 21,308 | 245,244 |
| korean_webtext | korean | 1,284,878 | 1,154,654 | 89.9% | 11,502 | 870 | 117,847 |
| aihub_modu | korean | 58,997 | 54,283 | 92.0% | 2,350 | 18 | 2,346 |
| aihub_books | korean | 5,823 | 2,782 | 47.8% | 2,532 | 2 | 507 |
| aihub_online_colloquial | korean | 22,859 | 19,775 | 86.5% | 63 | 46 | 2,975 |
| github-top-code | code | 1,121,474 | 735,300 | 65.6% | 384,617 | 0 | 0 |
| codeparrot_clean | code | 5,365,659 | 4,535,137 | 84.5% | 830,522 | 0 | 0 |
| starcoderdata | code | 42,191,832 | 31,833,073 | 75.4% | 10,358,341 | 0 | 0 |
| arxiv | science | 1,089,469 | 429,145 | 39.4% | 426,975 | 232,666 | 683 |
| open-web-math | science | 6,315,233 | 3,842,014 | 60.8% | 747,428 | 1,468,103 | 257,688 |
| peS2o | science | 69,950,435 | 66,057,444 | 94.4% | 1,665,868 | 2,075,245 | 151,878 |
| **TOTAL** | | **334,283,705** | **293,386,402** | **87.8%** | | | |
> **Note on arxiv (39.4% pass rate):** arXiv papers contain heavy LaTeX markup. Many documents failed the quality filter due to high symbol/digit density from mathematical notation. This is expected behavior.
> **Note on aihub_books (47.8% pass rate):** Korean book data showed high toxicity filter failure, likely due to older literary content containing dated language patterns flagged by the profanity filter.
---
## Token Estimates by Domain (Post-Filter, Pre-Dedup)
Tokens are counted using the keural SentencePiece tokenizer.
| Domain | Tokens | Target | Progress |
|--------|--------|--------|----------|
| English | 123.63B | 175B | ██████████████░░░░░░ 70.6% |
| Science | 66.24B | 75B | █████████████████░░░ 88.3% |
| Code | 52.00B | 75B | █████████████░░░░░░░ 69.3% |
| Korean | 7.14B | 175B | █░░░░░░░░░░░░░░░░░░░ 4.1% |
| **Total** | **248.01B** | **500B** | **49.6%** |
---
## Token Counts by Dataset (Post-Filter)
| Dataset | Domain | Tokens |
|---------|--------|--------|
| peS2o | science | 51.27B |
| starcoderdata | code | 42.04B |
| ccnews | english | 40.00B |
| fineweb | english | 34.57B |
| falcon-refinedweb | english | 32.96B |
| codeparrot_clean | code | 8.81B |
| openwebtext | english | 8.30B |
| open-web-math | science | 8.05B |
| arxiv | science | 6.93B |
| wikipedia (en) | english | 4.09B |
| gutenberg | english | 3.71B |
| aihub_modu | korean | 1.92B |
| oscar_ko_only | korean | 1.77B |
| korean_webtext | korean | 1.30B |
| github-top-code | code | 1.14B |
| namuwiki | korean | 1.05B |
| aihub_books | korean | 728.26M |
| wikipedia_ko | korean | 310.71M |
| aihub_online_colloquial | korean | 66.77M |
---
## Document Schema
Each document in this repository follows this schema:
```json
{
"dataset": "ccnews",
"id": "ccnews_000012345",
"domain": "english",
"text": "The filtered document text...",
"timestamp": "2024-11-01T10:30:00Z"
}
```
### Field Descriptions
| Field | Type | Description |
|-------|------|-------------|
| `dataset` | string | Source dataset name (e.g. `ccnews`, `peS2o`) |
| `id` | string | Unique document identifier (inherited from Stage 0.5 `doc_id`) |
| `domain` | string | One of: `english`, `korean`, `code`, `science` |
| `text` | string | Filtered document text (PII redacted where applicable) |
| `timestamp` | string\|null | Source timestamp if available |
---
## Processing Timestamps
| Event | Date (KST) |
|-------|------------|
| Filtering begins (first batch) | 2026-04-01 |
| Stage 1 active across all datasets | 2026-04-08 |
| All 19 datasets filtered | 2026-04-09 |
| Upload to this HuggingFace repo | 2026-04-09 |
| Last updated | 2026-04-09 22:45 KST |
---
## Licenses
This dataset inherits mixed licenses from source datasets. See [mkd-chanwoo/normalized-datasets-for-koreanLLM](https://huggingface.co/datasets/mkd-chanwoo/normalized-datasets-for-koreanLLM) for the full per-dataset license table.
> ⚠️ **License Notice**: Please review the license of each individual source dataset before use.
---
## Related Repositories
| Repo | Stage | Description |
|------|-------|-------------|
| [mkd-chanwoo/normalized-datasets-for-koreanLLM](https://huggingface.co/datasets/mkd-chanwoo/normalized-datasets-for-koreanLLM) | Stage 0.5 | Input to this stage — normalized raw data |
| **This repo** | Stage 1 | Quality + language + toxicity filtered |
| [mkd-chanwoo/keural-datasets](https://huggingface.co/datasets/mkd-chanwoo/keural-datasets) | Stage 2 | Final deduplicated + sharded production data |
| [mkd-chanwoo/simplemodel-270M](https://huggingface.co/mkd-chanwoo/simplemodel-270M) | Model | LLM trained on this pipeline's output |
提供机构:
mkd-chanwoo



