mkd-chanwoo/filtered-datasets-for-koreanLLM

Name: mkd-chanwoo/filtered-datasets-for-koreanLLM
Creator: mkd-chanwoo
Published: 2026-04-10 03:00:44
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/mkd-chanwoo/filtered-datasets-for-koreanLLM

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - ko license: other task_categories: - text-generation tags: - pretraining - nlp - korean - english - code - science - corpus - filtered - quality-filtered - jsonl pretty_name: Filtered Datasets for Korean LLM (Stage 1) size_categories: - 100M<n<1B --- # Filtered Datasets for Korean LLM > **Stage 1 — Filtering output** of the Keural Korean LLM pretraining pipeline. > Every document has passed a 3-stage filter: quality, language detection, and toxicity/PII removal. > Input: ~334M documents → Output: **~293M documents (87.8% pass rate)**. --- ## Quick Stats | Metric | Value | |--------|-------| | Total input documents | 334,283,705 | | Total documents passed | 293,386,402 | | Overall pass rate | 87.8% | | Domains | English · Korean · Code · Science | | Format | JSONL (one JSON object per line) | | Filters applied | Quality · Language · Toxicity/PII | | Tokenizer | keural SentencePiece (`mkd-ai/keural-tokenizer`) | | Last updated | 2026-04-09 | --- ## Where This Fits in the Pipeline ```mermaid flowchart LR A["Stage 0\nRaw Download"] --> B["Stage 0.5\nNormalization\nmkd-chanwoo/normalized-datasets-for-koreanLLM"] B --> C["Stage 1\nFiltering\n← YOU ARE HERE\n293M docs / 248B tokens"] C --> D["Stage 2\nDedup + Shard\nmkd-chanwoo/keural-datasets\n329M docs / 220B tokens"] style C fill:#f0a500,color:#000 ``` --- ## For Beginners: What Does Filtering Do? Raw text from the internet contains a lot of noise: - Documents in the wrong language (e.g., a French article in an English dataset) - Very short or very long documents that are likely junk - Documents with too many numbers, URLs, or repeated lines (often scraped boilerplate) - Documents containing toxic content, hate speech, or personal information (phone numbers, resident IDs) **Stage 1 filtering** removes or redacts these problems so that only clean, high-quality documents proceed to the next stage. Three filters run **sequentially** on every document. If a document fails any one filter, it is discarded. --- ## Filter 1 — Quality Filter Removes structurally poor documents based on statistical text properties. | Rule | Threshold | Action | |------|-----------|--------| | Minimum document length | < 200 characters | ❌ Reject | | Maximum document length | > 1,000,000 characters | ❌ Reject | | Digit character ratio | > 30% of all characters | ❌ Reject | | Repeated line ratio | > 20% of lines are duplicates | ❌ Reject | | Bullet point / list ratio | > 90% of lines start with bullet | ❌ Reject | | HTML tag ratio | > 10% HTML tags in content | ❌ Reject | --- ## Filter 2 — Language Detection Filter Ensures each document is actually in the language its domain declares (English or Korean). Code datasets are **exempt** from language filtering (code is language-agnostic). | Setting | Value | |---------|-------| | Primary detector | **FastText** `lid.176.bin` (supports 176 languages) | | Fallback detector | `langdetect` (used if FastText is inconclusive) | | Confidence threshold | **0.75** (75% minimum certainty) | | Applied to | English domain, Korean domain | | Not applied to | Code domain, Science domain | **How it works:** 1. FastText analyzes the document text 2. Returns predicted language + confidence score 3. If predicted language matches domain AND confidence ≥ 0.75 → ✅ Pass 4. Otherwise → ❌ Reject --- ## Filter 3 — Toxicity & PII Filter Removes or redacts documents containing harmful or private content. | Pattern | Action | |---------|--------| | Korean resident registration number (주민등록번호) | ❌ Entire document removed | | Credit card number | ❌ Entire document removed | | Profanity (English and Korean) | ❌ Entire document removed | | Spam keyword patterns | ❌ Entire document removed | | Phone numbers | ✏️ Redacted → `[PHONE]` | | Email addresses | ✏️ Redacted → `[EMAIL]` | --- ## Filter Pipeline Flow ```mermaid flowchart TD IN["Normalized Document\n(from Stage 0.5)"] --> Q["Quality Filter\nchar count · digit ratio\nrepeat ratio · HTML ratio"] Q -->|Fail| REJECT1["❌ Rejected\n(quality)"] Q -->|Pass| L["Language Filter\nFastText lid.176.bin\nconfidence ≥ 0.75"] L -->|Fail| REJECT2["❌ Rejected\n(language)"] L -->|Pass| T["Toxicity / PII Filter\nresident ID · credit card\nprofanity · spam"] T -->|Fail / Redact| REJECT3["❌ Removed or Redacted\n(toxicity / PII)"] T -->|Pass| OUT["✅ Filtered Document\n(written to Stage 1 output)"] ``` --- ## Per-Dataset Filter Statistics | Dataset | Domain | Input | Passed | Pass Rate | Quality Failed | Lang Failed | Toxic Failed | |---------|--------|-------|--------|-----------|---------------|-------------|--------------| | gutenberg | english | 48,284 | 42,914 | 88.9% | 2,428 | 1,735 | 1,198 | | openwebtext | english | 8,013,769 | 7,560,439 | 94.3% | 54,443 | 172,473 | 226,414 | | ccnews | english | 71,629,440 | 67,512,723 | 94.3% | 2,084,287 | 1,188,980 | 843,450 | | falcon-refinedweb | english | 59,839,870 | 50,720,322 | 84.8% | 2,783,522 | 4,781,279 | 1,537,584 | | fineweb | english | 56,181,731 | 49,909,714 | 88.8% | 1,150,071 | 2,361,574 | 2,760,372 | | wikipedia (en) | english | 6,407,814 | 5,488,464 | 85.7% | 342,285 | 435,857 | 141,061 | | namuwiki | korean | 565,293 | 474,367 | 83.9% | 52,389 | 5,364 | 32,573 | | wikipedia_ko | korean | 515,425 | 472,758 | 91.7% | 16,148 | 13,005 | 13,514 | | oscar_ko_only | korean | 3,675,420 | 2,541,094 | 69.1% | 867,746 | 21,308 | 245,244 | | korean_webtext | korean | 1,284,878 | 1,154,654 | 89.9% | 11,502 | 870 | 117,847 | | aihub_modu | korean | 58,997 | 54,283 | 92.0% | 2,350 | 18 | 2,346 | | aihub_books | korean | 5,823 | 2,782 | 47.8% | 2,532 | 2 | 507 | | aihub_online_colloquial | korean | 22,859 | 19,775 | 86.5% | 63 | 46 | 2,975 | | github-top-code | code | 1,121,474 | 735,300 | 65.6% | 384,617 | 0 | 0 | | codeparrot_clean | code | 5,365,659 | 4,535,137 | 84.5% | 830,522 | 0 | 0 | | starcoderdata | code | 42,191,832 | 31,833,073 | 75.4% | 10,358,341 | 0 | 0 | | arxiv | science | 1,089,469 | 429,145 | 39.4% | 426,975 | 232,666 | 683 | | open-web-math | science | 6,315,233 | 3,842,014 | 60.8% | 747,428 | 1,468,103 | 257,688 | | peS2o | science | 69,950,435 | 66,057,444 | 94.4% | 1,665,868 | 2,075,245 | 151,878 | | **TOTAL** | | **334,283,705** | **293,386,402** | **87.8%** | | | | > **Note on arxiv (39.4% pass rate):** arXiv papers contain heavy LaTeX markup. Many documents failed the quality filter due to high symbol/digit density from mathematical notation. This is expected behavior. > **Note on aihub_books (47.8% pass rate):** Korean book data showed high toxicity filter failure, likely due to older literary content containing dated language patterns flagged by the profanity filter. --- ## Token Estimates by Domain (Post-Filter, Pre-Dedup) Tokens are counted using the keural SentencePiece tokenizer. | Domain | Tokens | Target | Progress | |--------|--------|--------|----------| | English | 123.63B | 175B | ██████████████░░░░░░ 70.6% | | Science | 66.24B | 75B | █████████████████░░░ 88.3% | | Code | 52.00B | 75B | █████████████░░░░░░░ 69.3% | | Korean | 7.14B | 175B | █░░░░░░░░░░░░░░░░░░░ 4.1% | | **Total** | **248.01B** | **500B** | **49.6%** | --- ## Token Counts by Dataset (Post-Filter) | Dataset | Domain | Tokens | |---------|--------|--------| | peS2o | science | 51.27B | | starcoderdata | code | 42.04B | | ccnews | english | 40.00B | | fineweb | english | 34.57B | | falcon-refinedweb | english | 32.96B | | codeparrot_clean | code | 8.81B | | openwebtext | english | 8.30B | | open-web-math | science | 8.05B | | arxiv | science | 6.93B | | wikipedia (en) | english | 4.09B | | gutenberg | english | 3.71B | | aihub_modu | korean | 1.92B | | oscar_ko_only | korean | 1.77B | | korean_webtext | korean | 1.30B | | github-top-code | code | 1.14B | | namuwiki | korean | 1.05B | | aihub_books | korean | 728.26M | | wikipedia_ko | korean | 310.71M | | aihub_online_colloquial | korean | 66.77M | --- ## Document Schema Each document in this repository follows this schema: ```json { "dataset": "ccnews", "id": "ccnews_000012345", "domain": "english", "text": "The filtered document text...", "timestamp": "2024-11-01T10:30:00Z" } ``` ### Field Descriptions | Field | Type | Description | |-------|------|-------------| | `dataset` | string | Source dataset name (e.g. `ccnews`, `peS2o`) | | `id` | string | Unique document identifier (inherited from Stage 0.5 `doc_id`) | | `domain` | string | One of: `english`, `korean`, `code`, `science` | | `text` | string | Filtered document text (PII redacted where applicable) | | `timestamp` | string\|null | Source timestamp if available | --- ## Processing Timestamps | Event | Date (KST) | |-------|------------| | Filtering begins (first batch) | 2026-04-01 | | Stage 1 active across all datasets | 2026-04-08 | | All 19 datasets filtered | 2026-04-09 | | Upload to this HuggingFace repo | 2026-04-09 | | Last updated | 2026-04-09 22:45 KST | --- ## Licenses This dataset inherits mixed licenses from source datasets. See [mkd-chanwoo/normalized-datasets-for-koreanLLM](https://huggingface.co/datasets/mkd-chanwoo/normalized-datasets-for-koreanLLM) for the full per-dataset license table. > ⚠️ **License Notice**: Please review the license of each individual source dataset before use. --- ## Related Repositories | Repo | Stage | Description | |------|-------|-------------| | [mkd-chanwoo/normalized-datasets-for-koreanLLM](https://huggingface.co/datasets/mkd-chanwoo/normalized-datasets-for-koreanLLM) | Stage 0.5 | Input to this stage — normalized raw data | | **This repo** | Stage 1 | Quality + language + toxicity filtered | | [mkd-chanwoo/keural-datasets](https://huggingface.co/datasets/mkd-chanwoo/keural-datasets) | Stage 2 | Final deduplicated + sharded production data | | [mkd-chanwoo/simplemodel-270M](https://huggingface.co/mkd-chanwoo/simplemodel-270M) | Model | LLM trained on this pipeline's output |

提供机构：

mkd-chanwoo

5,000+

优质数据集

54 个

任务类型

进入经典数据集