mkd-chanwoo/normalized-datasets-for-koreanLLM
收藏Hugging Face2026-04-21 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mkd-chanwoo/normalized-datasets-for-koreanLLM
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- ko
license: other
task_categories:
- text-generation
tags:
- pretraining
- nlp
- korean
- english
- code
- science
- corpus
- normalized
- jsonl
pretty_name: Normalized Datasets for Korean LLM (Stage 0.5)
size_categories:
- 100M<n<1B
---
# Normalized Datasets for Korean LLM
Stage 0.5 — Normalization output of the Keural Korean LLM pretraining pipeline.
Raw text from 38 source datasets converted into a single unified JSONL schema.
No filtering has been applied. This is the clean, structured raw form.
---
## Quick Stats
| Metric | Value |
|--------|-------|
| Total documents normalized | ~836,302,981 |
| Number of source datasets | 38 |
| Domains | English, Korean, Code, Science |
| Format | JSONL (one JSON object per line) |
| Schema version | v2 |
| Pipeline stage | Stage 0.5 (after download, before filtering) |
| Last updated | 2026-04-20 |
---
## Where This Fits in the Pipeline
```
Stage 0 Stage 0.5 Stage 1 Stage 2
Raw Download --> Normalization --> Filtering --> Dedup + Shard
~1.5 TB ~836M docs ~649M docs ~549B tokens
(YOU ARE HERE)
```
---
## What Is Normalization?
Each source dataset stores text in a different format and field name.
For example, Gutenberg uses the field name `TEXT`, but CCNews uses `plain_text`, and StarCoderData uses `content`.
Normalization solves this by:
1. Reading each source dataset in its original format
2. Extracting the text from its specific field
3. Writing everything into one unified schema with the same field names
4. Adding metadata like `domain`, `language`, `doc_id`, `license`
After normalization, all downstream stages (filtering, deduplication) work on one consistent format regardless of the original source.
---
## Source Datasets & Field Mappings
### English Domain
| Dataset Key | HuggingFace Source | Source Field | Docs Normalized | Language |
|-------------|-------------------|-------------|-----------------|----------|
| `gutenberg` | [sedthh/gutenberg_english](https://huggingface.co/datasets/sedthh/gutenberg_english) | `TEXT` | 48,284 | en |
| `openwebtext` | [Skylion007/openwebtext](https://huggingface.co/datasets/Skylion007/openwebtext) | `text` | 8,013,769 | en |
| `ccnews` | [stanford-oval/ccnews](https://huggingface.co/datasets/stanford-oval/ccnews) | `plain_text` | 71,629,440 | en |
| `falcon-refinedweb` | [tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | `content` | 59,839,870 | en |
| `fineweb` | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | `text` | 56,181,731 | en |
| `fineweb_edu` | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | `text` | 57,121,167 | en |
| `hplt_english_1` | [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) | `text` | 24,520,357 | en |
| `hplt_english_2` | [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) | `text` | 16,085,779 | en |
| `hplt_english_3` | [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) | `text` | 22,951,515 | en |
| `wikipedia` | [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) config: `20231101.en` | `text` | 6,407,814 | en |
### Korean Domain
| Dataset Key | Source | Source Field | Docs Normalized | Language |
|-------------|--------|-------------|-----------------|----------|
| `namuwiki` | [heegyu/namuwiki-extracted](https://huggingface.co/datasets/heegyu/namuwiki-extracted) | `text` | 565,293 | ko |
| `wikipedia_ko` | [lcw99/wikipedia-korean-20240501](https://huggingface.co/datasets/lcw99/wikipedia-korean-20240501) | `text` | 515,425 | ko |
| `oscar_ko_only` | [lcw99/oscar-ko-only](https://huggingface.co/datasets/lcw99/oscar-ko-only) | `text` | 3,675,420 | ko |
| `korean_webtext` | [HAERAE-HUB/KOREAN-WEBTEXT](https://huggingface.co/datasets/HAERAE-HUB/KOREAN-WEBTEXT) | `text` | 1,284,878 | ko |
| `hplt_korean` | [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) | `text` | 38,866,835 | ko |
| `culturax_ko` | [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | `text` | 17,072,188 | ko |
| `c4_korean` | [allenai/c4](https://huggingface.co/datasets/allenai/c4) | `text` | 15,618,718 | ko |
| `cc100_documents_korean` | [singletongue/cc100-documents](https://huggingface.co/datasets/singletongue/cc100-documents) | `text` | 35,678,358 | ko |
| `fineweb2_korean` | [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) | `text` | 59,423,122 | ko |
| `wanjuan_korean` | [opendatalab/WanJuan-Korean](https://huggingface.co/datasets/opendatalab/WanJuan-Korean) | `content` | 68,894,937 | ko |
| `aihub_modu` | AIHub — Korean government open data (local) | parsed from AIHub format | 58,997 | ko |
| `aihub_books` | AIHub — Korean government open data (local) | parsed from AIHub format | 5,823 | ko |
| `aihub_online_colloquial` | AIHub — Korean government open data (local) | parsed from AIHub format | 22,859 | ko |
| `aihub_korean_corpus_literature` | AIHub — Korean government open data (local) | parsed from AIHub format | 908 | ko |
### Code Domain
| Dataset Key | HuggingFace Source | Source Field | Docs Normalized | Language |
|-------------|-------------------|-------------|-----------------|----------|
| `github-top-code` | [ronantakizawa/github-top-code](https://huggingface.co/datasets/ronantakizawa/github-top-code) | `content` | 1,121,474 | en (code) |
| `codeparrot_clean` | [codeparrot/codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean) | `content` | 5,365,659 | en (code) |
| `starcoderdata` | [bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) | `content` | 42,191,832 | en (code) |
| `the_stack_c` | [bigcode/the-stack](https://huggingface.co/datasets/bigcode/the-stack) | `content` | 21,383,810 | en (code) |
| `the_stack_java` | [bigcode/the-stack](https://huggingface.co/datasets/bigcode/the-stack) | `content` | 31,222,087 | en (code) |
| `the_stack_python` | [bigcode/the-stack](https://huggingface.co/datasets/bigcode/the-stack) | `content` | 24,214,204 | en (code) |
### Science Domain
| Dataset Key | HuggingFace Source | Source Field | Docs Normalized | Language |
|-------------|-------------------|-------------|-----------------|----------|
| `arxiv` | [KiteFishAI/arxiv-tex-corpus-full](https://huggingface.co/datasets/KiteFishAI/arxiv-tex-corpus-full) | `text` | 1,089,469 | en |
| `open-web-math` | [open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) | `text` | 6,315,233 | en |
| `peS2o` | [allenai/peS2o](https://huggingface.co/datasets/allenai/peS2o) | `text` | 69,950,435 | en |
| `s2orc` | [sentence-transformers/s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc) | `abstract` | 43,180,281 | en |
| `algebraic_stack` | [typeof/algebraic-stack](https://huggingface.co/datasets/typeof/algebraic-stack) | `text` | 3,440,226 | en |
| `fiwebmath_4plus` | [HuggingFaceTB/finemath](https://huggingface.co/datasets/HuggingFaceTB/finemath) | `text` | 6,699,493 | en |
| `pubmed_abstracts` | [casinca/PUBMED_title_abstracts_2019_baseline](https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline) | `text` | 15,517,555 | en |
| `open_med_text` | [ywchoi/OpenMedText](https://huggingface.co/datasets/ywchoi/OpenMedText) | `text` | 127,736 | en |
---
## Total Documents by Domain
| Domain | Docs Normalized | % of Total |
|--------|-----------------|------------|
| English | 322,799,726 | 38.6% |
| Korean | 241,683,761 | 28.9% |
| Code | 125,499,066 | 15.0% |
| Science | 146,320,428 | 17.5% |
| **Total** | **836,302,981** | **100%** |
---
## Normalization Process
```
Source File Read original format Extract text from
(HuggingFace / AIHub) --> (JSONL / Parquet / TXT) --> dataset-specific field
(TEXT / text / content / plain_text)
|
v
Assign doc_id Write unified JSONL Update checkpoint
= source_name + index with full metadata --> (resumable mid-stream)
```
Normalization is resumable. Each dataset tracks `line_index` and `doc_id` in a checkpoint file, so if the process is interrupted, it resumes exactly where it left off.
What normalization does NOT do:
- Does not modify or clean text content
- Does not filter documents
- Does not deduplicate
- Does not re-encode or translate
- Only reshapes structure and adds metadata
---
## Unified Document Schema
Every document in this repository follows this exact schema:
```json
{
"doc_id": "gutenberg_000000042",
"source_name": "gutenberg",
"domain": "english",
"language": "en",
"text": "The full original text of the document...",
"url": "https://source-url.com/page (if available, else null)",
"license": "Public Domain",
"source_file": "data/raw/gutenberg/train-00000-of-00001.parquet",
"source_index": 42,
"timestamp": "2026-03-15T08:22:11Z",
"processing_version": "v2"
}
```
### Field Descriptions
| Field | Type | Description |
|-------|------|-------------|
| `doc_id` | string | Unique ID: `{source_name}_{source_index}` |
| `source_name` | string | Dataset key (e.g. `gutenberg`, `ccnews`) |
| `domain` | string | One of: `english`, `korean`, `code`, `science` |
| `language` | string | ISO 639-1 code: `en` or `ko` |
| `text` | string | Raw document text (unmodified from source) |
| `url` | string or null | Original URL if provided by source dataset |
| `license` | string | Source dataset license |
| `source_file` | string | Local path to the source file it was read from |
| `source_index` | int | Row index within that source file |
| `timestamp` | string or null | Publication date/time if available in source |
| `processing_version` | string | Pipeline version (`v2`) |
---
## Normalization Statistics (Seen vs Written)
"Seen" = total rows read from source. "Written" = successfully normalized.
Rows not written are rows that failed to parse (malformed JSON, empty text, encoding errors).
| Dataset | Seen | Written | Normalized Rate |
|---------|------|---------|----------------|
| aihub_books | 5,974 | 5,823 | 97.5% |
| aihub_korean_corpus_literature | 3,276,057 | 908 | 0.0% (field mismatch) |
| aihub_modu | 117,994 | 58,997 | 50.0% (deduped at read) |
| aihub_online_colloquial | 45,894 | 22,859 | 49.8% (deduped at read) |
| algebraic_stack | 3,440,694 | 3,440,226 | ~100% |
| arxiv | 1,089,469 | 1,089,469 | 100% |
| c4_korean | 15,618,718 | 15,618,718 | 100% |
| cc100_documents_korean | 35,678,358 | 35,678,358 | 100% |
| ccnews | 71,629,440 | 71,629,440 | 100% |
| codeparrot_clean | 5,365,659 | 5,365,659 | 100% |
| culturax_ko | 20,557,310 | 20,557,309 | ~100% |
| falcon-refinedweb | 59,839,870 | 59,839,870 | 100% |
| fineweb | 81,595,324 | 81,595,324 | 100% |
| fineweb2_korean | 60,904,429 | 60,904,429 | 100% |
| fineweb_edu | 57,121,167 | 57,121,167 | 100% |
| fiwebmath_4plus | 6,699,493 | 6,699,493 | 100% |
| github-top-code | 1,122,139 | 1,121,474 | ~100% |
| gutenberg | 48,285 | 48,284 | ~100% |
| hplt_english_1 | 37,128,244 | 37,128,244 | 100% |
| hplt_english_2 | 16,085,779 | 16,085,779 | 100% |
| hplt_english_3 | 22,951,515 | 22,951,515 | 100% |
| hplt_korean | 38,866,835 | 38,866,835 | 100% |
| korean_webtext | 1,284,879 | 1,284,878 | ~100% |
| namuwiki | 565,293 | 565,293 | 100% |
| open-web-math | 6,315,233 | 6,315,233 | 100% |
| open_med_text | 127,736 | 127,736 | 100% |
| openwebtext | 8,013,769 | 8,013,769 | 100% |
| oscar_ko_only | 3,675,421 | 3,675,420 | ~100% |
| peS2o | 69,950,435 | 69,950,435 | 100% |
| pubmed_abstracts | 15,518,009 | 15,517,555 | ~100% |
| s2orc | 89,455,939 | 89,455,939 | 100% |
| starcoderdata | 42,191,832 | 42,191,832 | 100% |
| the_stack_c | 21,383,832 | 21,383,810 | ~100% |
| the_stack_java | 42,429,211 | 42,429,204 | ~100% |
| the_stack_python | 24,214,270 | 24,214,204 | ~100% |
| wanjuan_korean | 68,894,938 | 68,894,937 | ~100% |
| wikipedia (en) | 6,407,814 | 6,407,814 | 100% |
| wikipedia_ko | 515,425 | 515,425 | 100% |
---
## Tokenizer Used for Token Counting
All `tokens_count` values in this dataset are computed using the Keural SentencePiece tokenizer:
- **Model**: `mkd-ai/keural-tokenizer`
- **Type**: SentencePiece (Unigram)
- **Vocabulary file**: `keural_tokenizer.vocab`
- **Model file**: `keural_tokenizer.model`
This is the same tokenizer used by the Keural LLM model.
---
## Download & Processing Timeline
| Event | Date (KST) |
|-------|------------|
| Download of first datasets begins | 2026-04-01 |
| Normalization of first batch complete | 2026-04-08 |
| Initial 19 datasets normalized | 2026-04-09 |
| Additional datasets added (19 more) | 2026-04-10 through 2026-04-19 |
| All 38 datasets normalized | 2026-04-20 |
---
## Licenses
This dataset contains content from multiple sources with mixed licenses. Each source retains its original license.
| Dataset | License | Commercial Use |
|---------|---------|---------------|
| gutenberg | Public Domain | Yes |
| openwebtext | MIT | Yes |
| ccnews | Unknown | Caution |
| falcon-refinedweb | ODC-By 1.0 | Caution (copyright unclear) |
| fineweb | ODC-By | Yes |
| fineweb_edu | ODC-By | Yes |
| fineweb2_korean | ODC-By | Yes |
| hplt_english_1/2/3 | CC0-1.0 | Yes |
| wikipedia (en) | CC-BY-SA 3.0 | Yes (with attribution) |
| namuwiki | CC BY-NC-SA 2.0 KR | No (non-commercial only) |
| wikipedia_ko | Apache-2.0 | Yes |
| oscar_ko_only | Open (Common Crawl) | Yes |
| korean_webtext | Unspecified | Caution |
| hplt_korean | CC0-1.0 | Yes |
| culturax_ko | Mixed (mC4 / OSCAR / CC) | Caution |
| c4_korean | ODC-By | Yes |
| cc100_documents_korean | MIT | Yes |
| wanjuan_korean | Apache-2.0 | Yes |
| aihub_modu | AIHub Terms of Use | No (research only) |
| aihub_books | AIHub Terms of Use | No (research only) |
| aihub_online_colloquial | AIHub Terms of Use | No (research only) |
| aihub_korean_corpus_literature | AIHub Terms of Use | No (research only) |
| github-top-code | MIT | Yes (permissive only repos) |
| codeparrot_clean | Permissive (filtered GitHub) | Yes |
| starcoderdata | BigCode OpenRAIL-M | Restricted |
| the_stack_c | BigCode OpenRAIL-M | Restricted |
| the_stack_java | BigCode OpenRAIL-M | Restricted |
| the_stack_python | BigCode OpenRAIL-M | Restricted |
| arxiv | MIT | Caution (individual paper copyright) |
| open-web-math | ODC-By 1.0 | Yes |
| peS2o | ODC-By | Yes |
| s2orc | ODC-By | Yes |
| algebraic_stack | MIT | Yes |
| fiwebmath_4plus | ODC-By | Yes |
| pubmed_abstracts | CC0 (NLM/PubMed) | Yes |
| open_med_text | Various (medical open access) | Caution |
License Notice: This repository inherits mixed licenses from its source datasets.
Please review the license of each individual source before commercial or research use.
---
## Related Repositories
| Repo | Stage | Description |
|------|-------|-------------|
| **This repo** | Stage 0.5 | Normalized raw data |
| [mkd-chanwoo/filtered-datasets-for-koreanLLM](https://huggingface.co/datasets/mkd-chanwoo/filtered-datasets-for-koreanLLM) | Stage 1 | Quality + language + toxicity filtered |
| [mkd-chanwoo/keural-datasets](https://huggingface.co/datasets/mkd-chanwoo/keural-datasets) | Stage 2 | Final deduplicated + sharded production data |
| [mkd-chanwoo/simplemodel-270M](https://huggingface.co/mkd-chanwoo/simplemodel-270M) | Model | LLM trained on this pipeline's output |
提供机构:
mkd-chanwoo
搜集汇总
数据集介绍

构建方式
在构建大规模语言模型预训练语料库的背景下,该数据集作为Keural韩语大语言模型预处理流水线的中间产物,其构建核心在于数据归一化。该过程从19个异构的原始数据源中提取文本,这些源数据涵盖了英语、韩语、代码和科学文献等多个领域,并以JSONL、Parquet、CSV等多种格式存储。归一化操作旨在解决原始数据格式与字段命名不统一的问题,通过读取各源数据,从其特定的字段(如`TEXT`、`text`、`content`、`plain_text`)中提取原始文本内容,并统一写入一个标准化的JSONL格式中。此过程为每个文档添加了包括领域、语言、文档ID、许可证等在内的元数据,但严格保留了文本内容的原始性,未进行任何清洗、过滤或去重处理,从而生成了约5.35亿条结构一致、可直接用于下游处理的规范化文档。
特点
该数据集作为韩语大语言模型预训练流程中的关键一环,其显著特点在于其纯粹的结构归一化属性。数据集严格遵循统一的JSONL文档模式,每个文档均包含`doc_id`、`source_name`、`domain`、`language`、`text`等标准化字段,确保了跨不同来源数据的高度一致性。其内容构成多元,以英语和科学文献为主,辅以代码及韩语文本,为模型提供了跨领域的知识基础。尤为重要的是,该数据集忠实保留了所有源数据的原始文本内容与相应许可证信息,未引入任何修改或筛选偏差,为后续的过滤、去重等处理阶段提供了干净、透明的数据起点。其构建过程具备可恢复性,通过检查点机制确保了大规模数据处理流程的鲁棒性。
使用方法
作为预处理流水线的中间产物,该数据集的主要用途是作为下游数据处理的标准化输入。研究人员或开发者可直接通过HuggingFace平台加载此数据集,利用其统一的JSONL格式和丰富的元数据,便捷地进行后续的语料质量过滤、语言识别、毒性内容筛查等操作。由于数据已归一化,下游处理逻辑无需再适配多种原始数据格式,极大地简化了流程。该数据集亦可作为研究多语言、多领域语料库构建与分析的基准数据,其详尽的来源映射与统计信息为语料构成分析提供了可靠依据。用户在使用时需注意,数据集内含混合许可证,在商业或特定研究用途前应逐一核查各源数据的许可条款。
背景与挑战
背景概述
在大型语言模型(LLM)预训练领域,构建高质量、多语言且领域覆盖广泛的预训练语料库是提升模型性能的关键基础。Normalized Datasets for Korean LLM(Keural Korean LLM预训练管道第0.5阶段)由研究人员mkd-chanwoo于2026年4月创建并发布,旨在为韩语及多语言大模型提供统一结构化的原始数据。该数据集整合了来自19个不同来源的约5.35亿份文档,涵盖英语、韩语、代码和科学四大领域,通过规范化处理将异构数据源转换为统一的JSONL格式,为后续的过滤、去重等处理阶段奠定了标准化基础。其核心研究问题在于解决多源、多格式文本数据在预训练流程中的结构不一致性,从而提升数据处理管道的效率与可复现性,对推动韩语及多语言大模型的发展具有重要的基础设施价值。
当前挑战
该数据集致力于解决多语言大模型预训练中语料库构建的结构化与标准化挑战。在领域层面,其主要应对多源异构数据的整合难题,例如不同数据集在文本字段命名、存储格式及许可证协议上的差异,这要求规范化过程具备高度的兼容性与元数据管理能力。在构建过程中,挑战体现在大规模数据处理的可靠性与一致性上:尽管规范化过程设计了可恢复的检查点机制以应对中断,但部分数据源(如AIHub数据集)在读取时因去重处理导致文档写入率仅为50%,反映了原始数据质量与结构复杂性带来的损耗。此外,数据涵盖多种许可证协议,在后续使用中需谨慎处理合规性问题,而韩语语料仅占总体的1.1%,也凸显了多语言资源平衡的挑战。
常用场景
经典使用场景
在大型语言模型预训练领域,数据预处理是构建高质量语料库的关键环节。该数据集作为Keural韩语LLM预训练流程的规范化输出,其经典使用场景在于为下游的过滤、去重和分片等处理阶段提供统一的结构化基础。通过将来自19个不同来源的原始文本转换为标准化的JSONL格式,研究者能够在一个一致的框架下高效处理多语言、多领域的海量文档,从而显著提升数据管道的可操作性和可扩展性。
衍生相关工作
该数据集是Keural数据预处理流水线的核心组成部分,直接衍生出一系列相关的经典工作。在规范化阶段之后,数据进入Stage 1的过滤阶段,产生了`mkd-chanwoo/filtered-datasets-for-koreanLLM`数据集,专注于质量、语言和毒性过滤。随后,Stage 2的`mkd-chanwoo/keural-datasets`完成了最终的去重和分片,成为可直接用于模型训练的生产数据。最终,这些处理后的语料被用于训练`mkd-chanwoo/simplemodel-270M`等大型语言模型,形成了一个完整的数据准备到模型训练的研究与应用闭环。
数据集最近研究
最新研究方向
在韩语大语言模型(LLM)预训练领域,数据集的规范化与多语言融合正成为前沿研究的关键焦点。Normalized Datasets for Korean LLM作为Keural预训练流程的中间产物,通过统一19个来源数据集的格式,构建了涵盖英语、韩语、代码及科学文本的标准化语料库。这一举措不仅解决了多源数据异构性带来的技术挑战,更为后续的过滤、去重及模型训练提供了结构化基础。当前研究热点集中于如何利用此类规范化数据提升韩语LLM的跨语言理解能力,特别是在低资源语言与高资源语言的平衡表征方面。随着全球多语言模型需求的增长,该数据集为探索韩语在代码生成、科学文献处理等专业领域的应用提供了重要支撑,推动了亚洲语言模型生态的多元化发展。
以上内容由遇见数据集搜集并总结生成



