mkd-chanwoo/keural-SFT
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mkd-chanwoo/keural-SFT
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- ko
- en
tags:
- sft
- instruction-tuning
- chatml
- korean
- math
- code
size_categories:
- 1M<n<10M
---
# Keural SFT Dataset
Bilingual (Korean/English) instruction-tuning dataset for the Keural LLM project.
Built from 14 curated sources and formatted in ChatML after multi-stage filtering.
---
## Dataset Summary
| Field | Value |
|-------|-------|
| Total samples | 1,144,119 |
| Total tokens | 710,280,675 (~710M) |
| Average tokens/sample | 621.0 |
| Max sequence length | 8,192 tokens |
| Language ratio | Korean 45.3% / English 54.7% |
| Format | ChatML |
| Number of shards | 115 (10,000 samples/shard) |
| Tokenizer | Keural SentencePiece |
---
## Format
Each sample contains a `text` field in ChatML format.
```
<|im_start|>system
{system_prompt}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{response}
<|im_end|>
```
The `system` turn is included only when present in the source data.
### Record Schema
```jsonl
{
"text": "<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n...<|im_end|>",
"source_name": "alpaca",
"license": "cc-by-4.0",
"n_tokens": 312
}
```
### Special Tokens
The Keural tokenizer includes the following reserved tokens:
| Token | ID | Description |
|-------|----|-------------|
| `<pad>` | 0 | Padding |
| `<bos>` | 1 | Beginning of sequence |
| `<eos>` | 2 | End of sequence |
| `<unk>` | 3 | Unknown token |
| `<\|im_start\|>` | — | Turn start marker (**must be added before training/serving**) |
| `<\|im_end\|>` | — | Turn end marker (**must be added before training/serving**) |
`<|im_start|>` and `<|im_end|>` are not in the base vocabulary and must be added as user-defined tokens prior to fine-tuning.
---
## Dataset Composition
### Per-Source Statistics
| Dataset | Samples | Total Tokens | Avg Tokens | Task | Language | License |
|---------|--------:|-------------:|----------:|------|----------|---------|
| ultrachat | 193,212 | 257,127,675 | 1,330.8 | general | en | MIT |
| orca_math_korean | 185,362 | 71,152,470 | 383.9 | math | ko | MIT |
| openorca | 138,639 | 57,813,277 | 417.0 | general | en | MIT |
| aihub_multisession_sci | 127,868 | 92,753,664 | 725.4 | general | ko | AI Hub Terms |
| mathinstruct | 153,921 | 48,697,451 | 316.4 | math | en | MIT |
| aihub_multisession_social | 85,346 | 69,374,613 | 812.9 | general | ko | AI Hub Terms |
| magicoder | 71,953 | 43,425,338 | 603.5 | code | en | MIT |
| koinstruct_qa | 45,299 | 20,179,572 | 445.5 | general | ko | Apache 2.0 |
| koinstruct_base | 42,276 | 22,795,596 | 539.2 | general | ko | Apache 2.0 |
| alpaca | 46,303 | 5,390,721 | 116.4 | general | en | CC BY 4.0 |
| koalpaca | 21,091 | 4,278,003 | 202.8 | general | ko | CC BY-SA 4.0 |
| competition_math | 12,040 | 4,042,968 | 335.8 | math | en | MIT |
| aihub_expert_qa | 10,778 | 10,921,351 | 1,013.3 | general | ko | AI Hub Terms |
| gsm8k | 10,031 | 2,327,976 | 232.1 | math | en | MIT |
| **Total** | **1,144,119** | **710,280,675** | **621.0** | | | |
### Domain Distribution
| Domain | Samples | Share | Total Tokens | Avg Tokens |
|--------|--------:|------:|-------------:|-----------:|
| general/en | 378,154 | 33.1% | 320,331,673 | 847.1 |
| general/ko | 332,658 | 29.1% | 220,302,799 | 662.3 |
| math/ko | 185,362 | 16.2% | 71,152,470 | 383.9 |
| math/en | 175,992 | 15.4% | 55,068,395 | 312.9 |
| code/en | 71,953 | 6.3% | 43,425,338 | 603.5 |
### Language Split
| Language | Samples | Share |
|----------|--------:|------:|
| English | 626,099 | 54.7% |
| Korean | 518,020 | 45.3% |
### Token Length Distribution
| Range | Samples | Share |
|-------|--------:|------:|
| 0–512 | 563,375 | 49.2% |
| 512–1,024 | 425,489 | 37.2% |
| 1,024–2,048 | 129,878 | 11.3% |
| 2,048–4,096 | 24,942 | 2.2% |
| 4,096+ | 435 | 0.04% |
---
## Processing Pipeline
```
A. collect → Download from HuggingFace Hub / AI Hub
B. structure → Normalize to standard messages schema
C. clean → Remove empty responses, role errors, control characters
D. quality → Quality score filter (min threshold: 0.6)
E. safety → Tag and remove harmful content
F. dedup → Exact hash + MinHash LSH deduplication (similarity threshold: 0.8)
G. format → Apply ChatML template
H. tokenize → Measure token length, truncate at 8,192
I. package → Split into 10,000-sample shards with SHA256 checksums
```
---
## Licenses
This dataset is a mixture of multiple sources with different licenses.
Please verify individual source licenses before any commercial use.
| License | Sources |
|---------|---------|
| MIT | OpenOrca, UltraChat, Magicoder, MathInstruct, competition_math, GSM8K, orca_math_korean |
| CC BY 4.0 | Stanford Alpaca |
| CC BY-SA 4.0 | KoAlpaca |
| Apache 2.0 | KoInstruct (base, qa) |
| AI Hub Terms (non-commercial research only) | AI Hub 71304, 71674, 71675 |
---
许可证:其他
语言:
- 韩语
- 英语
标签:
- 监督微调(SFT)
- 指令微调
- ChatML
- 韩语
- 数学
- 代码
样本量分类:
- 100万 < 样本数 < 1000万
---
# Keural SFT 数据集
Keural 大语言模型(LLM)项目所用的双语(韩语/英语)指令微调数据集。该数据集依托14个精选数据源构建,经多阶段过滤处理后,采用ChatML格式进行封装。
---
## 数据集概览
| 字段 | 取值 |
|-------|-------|
| 总样本数 | 1,144,119 |
| 总Token数 | 710,280,675(约7.1亿) |
| 单样本平均Token数 | 621.0 |
| 最大序列长度 | 8,192个Token |
| 语言占比 | 韩语45.3% / 英语54.7% |
| 数据格式 | ChatML |
| 分片数量 | 115个(每分片10,000条样本) |
| 分词器 | Keural SentencePiece分词器 |
---
## 数据格式
每个样本均包含一个ChatML格式的`text`字段。示例格式如下:
<|im_start|>system
{system_prompt}
<|im_end|>
<|im_start|>user
{user_message}
<|im_end|>
<|im_start|>assistant
{response}
<|im_end|>
仅当源数据中包含system轮时,才会在样本中加入该轮对话内容。
### 记录 Schema
采用JSONL格式的记录结构示例如下:
jsonl
{
"text": "<|im_start|>user
...<|im_end|>
<|im_start|>assistant
...<|im_end|>",
"source_name": "alpaca",
"license": "cc-by-4.0",
"n_tokens": 312
}
### 特殊 Token
Keural 分词器包含以下预留Token:
| 特殊Token | ID | 说明 |
|-------|----|-------------|
| `<pad>` | 0 | 填充Token |
| `<bos>` | 1 | 序列起始标记 |
| `<eos>` | 2 | 序列结束标记 |
| `<unk>` | 3 | 未知Token |
| `<|im_start|>` | — | 对话轮次起始标记(**训练/部署前必须添加**) |
| `<|im_end|>` | — | 对话轮次结束标记(**训练/部署前必须添加**) |
`<|im_start|>`与`<|im_end|>`不在基础词表中,微调前需将其作为用户自定义Token添加。
---
## 数据集构成
### 单数据源统计
| 数据集名称 | 样本数 | 总Token数 | 平均Token数 | 任务类型 | 语言 | 许可证 |
|---------|--------:|-------------:|----------:|------|----------|---------|
| ultrachat | 193,212 | 257,127,675 | 1,330.8 | 通用 | 英语 | MIT |
| orca_math_korean | 185,362 | 71,152,470 | 383.9 | 数学 | 韩语 | MIT |
| openorca | 138,639 | 57,813,277 | 417.0 | 通用 | 英语 | MIT |
| aihub_multisession_sci | 127,868 | 92,753,664 | 725.4 | 通用 | 韩语 | AI Hub 条款 |
| mathinstruct | 153,921 | 48,697,451 | 316.4 | 数学 | 英语 | MIT |
| aihub_multisession_social | 85,346 | 69,374,613 | 812.9 | 通用 | 韩语 | AI Hub 条款 |
| magicoder | 71,953 | 43,425,338 | 603.5 | 代码 | 英语 | MIT |
| koinstruct_qa | 45,299 | 20,179,572 | 445.5 | 通用 | 韩语 | Apache 2.0 |
| koinstruct_base | 42,276 | 22,795,596 | 539.2 | 通用 | 韩语 | Apache 2.0 |
| alpaca | 46,303 | 5,390,721 | 116.4 | 通用 | 英语 | CC BY 4.0 |
| koalpaca | 21,091 | 4,278,003 | 202.8 | 通用 | 韩语 | CC BY-SA 4.0 |
| competition_math | 12,040 | 4,042,968 | 335.8 | 数学 | 英语 | MIT |
| aihub_expert_qa | 10,778 | 10,921,351 | 1,013.3 | 通用 | 韩语 | AI Hub 条款 |
| gsm8k | 10,031 | 2,327,976 | 232.1 | 数学 | 英语 | MIT |
| **总计** | **1,144,119** | **710,280,675** | **621.0** | | | |
### 领域分布
| 领域 | 样本数 | 占比 | 总Token数 | 平均Token数 |
|--------|--------:|------:|-------------:|-----------:|
| 通用/英语 | 378,154 | 33.1% | 320,331,673 | 847.1 |
| 通用/韩语 | 332,658 | 29.1% | 220,302,799 | 662.3 |
| 数学/韩语 | 185,362 | 16.2% | 71,152,470 | 383.9 |
| 数学/英语 | 175,992 | 15.4% | 55,068,395 | 312.9 |
| 代码/英语 | 71,953 | 6.3% | 43,425,338 | 603.5 |
### 语言拆分
| 语言 | 样本数 | 占比 |
|----------|--------:|------:|
| 英语 | 626,099 | 54.7% |
| 韩语 | 518,020 | 45.3% |
### Token长度分布
| 长度区间 | 样本数 | 占比 |
|-------|--------:|------:|
| 0–512 | 563,375 | 49.2% |
| 512–1,024 | 425,489 | 37.2% |
| 1,024–2,048 | 129,878 | 11.3% |
| 2,048–4,096 | 24,942 | 2.2% |
| 4,096+ | 435 | 0.04% |
---
## 处理流程
A. 数据收集 → 从HuggingFace Hub / AI Hub下载数据源
B. 结构标准化 → 转换为标准对话消息Schema
C. 数据清洗 → 移除空回复、角色错误及控制字符
D. 质量过滤 → 基于质量评分筛选(最低阈值:0.6)
E. 安全审核 → 标记并移除有害内容
F. 去重处理 → 采用精确哈希+MinHash LSH算法去重(相似度阈值:0.8)
G. 格式封装 → 应用ChatML模板格式化数据
H. 分词统计与截断 → 计算Token长度,将序列截断至8,192个Token
I. 数据打包 → 拆分为每片10,000条样本的分片,并生成SHA256校验和
---
## 许可证说明
本数据集由多个带有不同许可证的数据源混合组成,商业使用前请务必核实各数据源的单独许可证条款。
| 许可证 | 对应数据源 |
|---------|---------|
| MIT | OpenOrca、UltraChat、Magicoder、MathInstruct、competition_math、GSM8K、orca_math_korean |
| CC BY 4.0 | Stanford Alpaca |
| CC BY-SA 4.0 | KoAlpaca |
| Apache 2.0 | KoInstruct(base、qa) |
| AI Hub 条款(仅可用于非商业研究) | AI Hub 71304、71674、71675 |
提供机构:
mkd-chanwoo



