Kyoto-Corpus
收藏魔搭社区2026-01-09 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/Nikity/Kyoto-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
# Kyoto-Corpus
**Kyoto-Corpus** is a high-quality, small-scale dataset designed for the instruction tuning of Small Language Models (SLMs).

The philosophy behind Kyoto-Corpus is "quality over quantity." Instead of being an entirely new dataset, it is a carefully curated, filtered, and unified collection of some of the best publicly available instruction and chat datasets. This process creates a clean, diverse, and effective corpus for training capable models like **Lille-130M-Instruct**.
---
## ✨ Features
* **Diverse & High-Quality Sources:** The corpus is built by aggregating well-regarded datasets covering general chat, instruction following, mathematics, and knowledge-based Q&A.
* **Unified Chat Format:** All data is standardized into a consistent chat format using special tokens (`<|startoftext|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`), making it ready to use with the **[Hastings](https://github.com/Nikityyy/Hastings)** tokenizer.
* **Careful Filtering & Deduplication:** The creation pipeline applies strict quality controls, including filtering out conversations that are too long (max 512 tokens), ensuring proper turn structure, and removing duplicate entries across all source datasets.
* **Optimized for Small Models:** The token limit and curated nature make this dataset particularly well-suited for training and fine-tuning SLMs without requiring massive computational resources.
* **Multiple Formats:** The dataset is available in two formats:
* **Parquet**
* **Plain Text**
* **Transparent & Reproducible:** The scripts used to generate the entire corpus from the source datasets are included in this repository, ensuring full transparency.
## 📊 Dataset Composition
Kyoto-Corpus is a blend of the following open-source datasets. The creation script processes, filters, and deduplicates the combined data to form the final corpus.
| Source Dataset | Type | Original Hugging Face Link |
| :--- | :--- | :--- |
| **ultrachat_200k** | General Purpose | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
| **smoltalk2** | General Purpose | [`HuggingFaceTB/smoltalk2`](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) |
| **smol-smoltalk** | General Purpose | [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| **WildChat-1M** | General Purpose | [`allenai/WildChat-1M`](https://huggingface.co/datasets/allenai/WildChat-1M) |
| **WizardLM_evol_instruct_V2** | General Purpose | [`WizardLMTeam/WizardLM_evol_instruct_V2_196k`](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) |
| **ifeval-like-data** | Instruction | [`argilla/ifeval-like-data`](https://huggingface.co/datasets/argilla/ifeval-like-data) |
| **tulu-3-sft-personas** | Instruction | [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) |
| **mmlu** | Knowledge | [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) |
| **gsm8k** | Math | [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) |
| **math_qa** | Math | [`allenai/math_qa`](https://huggingface.co/datasets/allenai/math_qa) |
| **MetaMathQA** | Math | [`meta-math/MetaMathQA`](https://huggingface.co/datasets/meta-math/MetaMathQA) |
## 📝 Data Format
Each entry in the dataset follows a strict conversational structure.
#### Parquet Format (Structured)
The Parquet file contains a `messages` column, which holds a list of dictionaries, and a `hf_dataset` column indicating the original source.
```json
{
"messages": [
{"role": "user", "content": "What is the capital of Japan?"},
{"role": "assistant", "content": "The capital of Japan is Tokyo."}
],
"hf_dataset": "Username/Repository"
}
```
#### Plain Text Format
The `train.txt` file contains the fully formatted string for each conversation, ready for tokenization.
```
<|startoftext|><|user|>What is the capital of Japan?<|assistant|>The capital of Japan is Tokyo.<|endoftext|>
```
## 🚀 Usage
You can easily load Kyoto-Corpus from the Hugging Face Hub using the `datasets` library.
```python
from datasets import load_dataset
ds_parquet = load_dataset("Nikityyy/Kyoto-Corpus", split="train")
print(ds_parquet[0])
```
## 🛠️ How It Was Created
The entire corpus was generated using the scripts in this repository (`script_parquet.py` and `script_small.py`). The process is as follows:
1. **Stream Data:** The script streams each source dataset from the Hugging Face Hub to minimize local storage requirements.
2. **Process in Parallel:** Data is processed in batches using Python's `multiprocessing` to leverage all available CPU cores.
3. **Format Unification:** Each entry is converted from its original format (e.g., `flat`, `mcq`, conversational) into the standardized chat structure.
4. **Filter & Truncate:** Conversations are validated for correctness (e.g., must start with a user turn). They are truncated or skipped if their tokenized length exceeds the `MAX_TOKENS` limit (512).
5. **Deduplicate:** A hash of each processed entry is generated (using `xxhash` for speed), and only unique entries are kept, ensuring no duplicates exist within or across datasets.
6. **Save Output:** The final, clean entries are saved to the Parquet and plain text files, along with a `data.json` file containing detailed statistics about the creation process.
## 🛠️ The truly open-source repos
Kyoto-Corpus is a key component of my initiative to build and release a complete, truly open-source stack for language modeling. All components are designed to work together seamlessly.
* **Tokenizer:** **[Hastings](https://github.com/Nikityyy/Hastings)** - A modern, efficient tokenizer with a 32k vocabulary.
* **Dataset:** **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)** (this repository) - A high-quality, small-scale dataset for instruction tuning.
* **Model:** **[lille](https://github.com/Nikityyy/lille)** - A powerful 130-million-parameter model trained from scratch using the Hastings tokenizer.
* **Optimizer:** **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - A memory-efficient, Triton-based implementation of the SophiaG optimizer.
* **Evaluations:** **[simple-eval](https://github.com/Nikityyy/simple-eval)** - A straightforward framework for evaluating model performance using an LLM as a Judge.
---
## 📜 License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/Nikityyy/Kyoto-Corpus/blob/main/LICENSE) file for details.
# 京都语料库(Kyoto-Corpus)
**Kyoto-Corpus**是一款专为小语言模型(Small Language Models, SLMs)指令微调打造的高质量小规模数据集。

"Kyoto-Corpus"的设计理念为"质量胜于数量"。它并非全新构建的数据集,而是从一批优质公开指令与对话数据集中精心筛选、提纯并统一整合而来,最终形成一套干净、多样且高效的语料库,可用于训练如**Lille-130M-Instruct**这类性能优异的模型。
---
## ✨ 核心特性
* **多样且高质量的数据源**:本语料库聚合了多个广受认可的数据集,涵盖通用对话、指令遵循、数学推理与知识型问答等场景。
* **统一对话格式**:所有数据均通过特殊Token(`<|startoftext|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`)标准化为统一的对话格式,可直接与**[Hastings](https://github.com/Nikityyy/Hastings)** 分词器配合使用。
* **严格筛选与去重**:构建流程采用了严苛的质量管控手段,包括过滤过长对话(最长支持512个Token)、确保对话回合结构合规,以及消除所有源数据集间的重复条目。
* **专为小模型优化**:基于Token长度限制与精心筛选的特性,该数据集特别适合小语言模型的训练与微调,无需耗费海量计算资源。
* **多格式支持**:数据集提供两种存储格式:
* **Parquet**
* **纯文本(Plain Text)**
* **透明可复现**:本仓库包含了从源数据集生成完整语料库的全部脚本,确保整个流程完全透明可复现。
## 📊 数据集构成
Kyoto-Corpus整合了以下开源数据集。通过构建脚本对合并后的数据进行处理、筛选与去重,最终得到本语料库。
| 源数据集 | 类型 | 原始Hugging Face链接 |
| :--- | :--- | :--- |
| **ultrachat_200k** | 通用型 | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
| **smoltalk2** | 通用型 | [`HuggingFaceTB/smoltalk2`](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) |
| **smol-smoltalk** | 通用型 | [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| **WildChat-1M** | 通用型 | [`allenai/WildChat-1M`](https://huggingface.co/datasets/allenai/WildChat-1M) |
| **WizardLM_evol_instruct_V2** | 通用型 | [`WizardLMTeam/WizardLM_evol_instruct_V2_196k`](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) |
| **ifeval-like-data** | 指令型 | [`argilla/ifeval-like-data`](https://huggingface.co/datasets/argilla/ifeval-like-data) |
| **tulu-3-sft-personas** | 指令型 | [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) |
| **mmlu** | 知识型 | [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) |
| **gsm8k** | 数学型 | [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) |
| **math_qa** | 数学型 | [`allenai/math_qa`](https://huggingface.co/datasets/allenai/math_qa) |
| **MetaMathQA** | 数学型 | [`meta-math/MetaMathQA`](https://huggingface.co/datasets/meta-math/MetaMathQA) |
## 📝 数据格式
数据集中的每一条目均遵循严格的对话结构。
#### Parquet格式(结构化)
Parquet文件包含`messages`列与`hf_dataset`列:前者存储字典列表,后者标注原始数据来源。
json
{
"messages": [
{"role": "user", "content": "日本的首都是哪里?"},
{"role": "assistant", "content": "日本的首都是东京。"}
],
"hf_dataset": "Username/Repository"
}
#### 纯文本格式
`train.txt`文件中存储了每条对话的格式化字符串,可直接用于分词:
<|startoftext|><|user|>日本的首都是哪里?<|assistant|>日本的首都是东京。<|endoftext|>
## 🚀 使用方法
你可以通过`datasets`库从Hugging Face Hub快速加载Kyoto-Corpus。
python
from datasets import load_dataset
ds_parquet = load_dataset("Nikityyy/Kyoto-Corpus", split="train")
print(ds_parquet[0])
## 🛠️ 构建流程
完整语料库通过本仓库中的脚本(`script_parquet.py`与`script_small.py`)生成,流程如下:
1. **流式加载数据**:脚本从Hugging Face Hub流式读取每个源数据集,以最小化本地存储占用。
2. **并行处理**:使用Python的`multiprocessing`模块将数据分批处理,充分利用所有可用CPU核心。
3. **格式统一**:将每条原始数据(如`flat`、`mcq`、对话式格式)转换为标准化的对话结构。
4. **筛选与截断**:验证对话的合规性(例如必须以用户发言开头),若分词后长度超过`MAX_TOKENS`阈值(512),则截断或直接跳过该对话。
5. **去重处理**:为每条处理后的条目生成哈希值(使用`xxhash`以保证速度),仅保留唯一条目,确保数据集内部及跨源无重复数据。
6. **保存输出**:将最终清洗后的条目保存为Parquet与纯文本文件,同时生成`data.json`文件,包含构建流程的详细统计信息。
## 🛠️ 真正的开源栈
Kyoto-Corpus是我构建并发布完整、真正开源语言模型栈的核心组件之一,所有组件均可无缝协同工作。
* **分词器**:**[Hastings](https://github.com/Nikityyy/Hastings)** - 一款现代高效的分词器,拥有32k词表。
* **数据集**:**[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)**(本仓库) - 专为指令微调打造的高质量小规模数据集。
* **模型**:**[lille](https://github.com/Nikityyy/lille)** - 一款基于Hastings分词器从头训练的1.3亿参数高性能模型。
* **优化器**:**[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - 基于Triton实现的内存高效型SophiaG优化器。
* **评估框架**:**[simple-eval](https://github.com/Nikityyy/simple-eval)** - 一款简洁的模型评估框架,支持以大语言模型作为评判者进行性能评估。
---
## 📜 许可证
本项目采用MIT许可证,详情请参见[LICENSE](https://github.com/Nikityyy/Kyoto-Corpus/blob/main/LICENSE)文件。
提供机构:
maas
创建时间:
2025-09-03



