Kyoto-Corpus

Name: Kyoto-Corpus
Creator: maas
Published: 2026-01-09 19:22:54
License: 暂无描述

魔搭社区2026-01-09 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/Nikity/Kyoto-Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

# Kyoto-Corpus **Kyoto-Corpus** is a high-quality, small-scale dataset designed for the instruction tuning of Small Language Models (SLMs). ![Lille-Header](assets/lille-header.png) The philosophy behind Kyoto-Corpus is "quality over quantity." Instead of being an entirely new dataset, it is a carefully curated, filtered, and unified collection of some of the best publicly available instruction and chat datasets. This process creates a clean, diverse, and effective corpus for training capable models like **Lille-130M-Instruct**. --- ## ✨ Features * **Diverse & High-Quality Sources:** The corpus is built by aggregating well-regarded datasets covering general chat, instruction following, mathematics, and knowledge-based Q&A. * **Unified Chat Format:** All data is standardized into a consistent chat format using special tokens (`<|startoftext|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`), making it ready to use with the **[Hastings](https://github.com/Nikityyy/Hastings)** tokenizer. * **Careful Filtering & Deduplication:** The creation pipeline applies strict quality controls, including filtering out conversations that are too long (max 512 tokens), ensuring proper turn structure, and removing duplicate entries across all source datasets. * **Optimized for Small Models:** The token limit and curated nature make this dataset particularly well-suited for training and fine-tuning SLMs without requiring massive computational resources. * **Multiple Formats:** The dataset is available in two formats: * **Parquet** * **Plain Text** * **Transparent & Reproducible:** The scripts used to generate the entire corpus from the source datasets are included in this repository, ensuring full transparency. ## 📊 Dataset Composition Kyoto-Corpus is a blend of the following open-source datasets. The creation script processes, filters, and deduplicates the combined data to form the final corpus. | Source Dataset | Type | Original Hugging Face Link | | :--- | :--- | :--- | | **ultrachat_200k** | General Purpose | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | | **smoltalk2** | General Purpose | [`HuggingFaceTB/smoltalk2`](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) | | **smol-smoltalk** | General Purpose | [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | | **WildChat-1M** | General Purpose | [`allenai/WildChat-1M`](https://huggingface.co/datasets/allenai/WildChat-1M) | | **WizardLM_evol_instruct_V2** | General Purpose | [`WizardLMTeam/WizardLM_evol_instruct_V2_196k`](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) | | **ifeval-like-data** | Instruction | [`argilla/ifeval-like-data`](https://huggingface.co/datasets/argilla/ifeval-like-data) | | **tulu-3-sft-personas** | Instruction | [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | | **mmlu** | Knowledge | [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) | | **gsm8k** | Math | [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) | | **math_qa** | Math | [`allenai/math_qa`](https://huggingface.co/datasets/allenai/math_qa) | | **MetaMathQA** | Math | [`meta-math/MetaMathQA`](https://huggingface.co/datasets/meta-math/MetaMathQA) | ## 📝 Data Format Each entry in the dataset follows a strict conversational structure. #### Parquet Format (Structured) The Parquet file contains a `messages` column, which holds a list of dictionaries, and a `hf_dataset` column indicating the original source. ```json { "messages": [ {"role": "user", "content": "What is the capital of Japan?"}, {"role": "assistant", "content": "The capital of Japan is Tokyo."} ], "hf_dataset": "Username/Repository" } ``` #### Plain Text Format The `train.txt` file contains the fully formatted string for each conversation, ready for tokenization. ``` <|startoftext|><|user|>What is the capital of Japan?<|assistant|>The capital of Japan is Tokyo.<|endoftext|> ``` ## 🚀 Usage You can easily load Kyoto-Corpus from the Hugging Face Hub using the `datasets` library. ```python from datasets import load_dataset ds_parquet = load_dataset("Nikityyy/Kyoto-Corpus", split="train") print(ds_parquet[0]) ``` ## 🛠️ How It Was Created The entire corpus was generated using the scripts in this repository (`script_parquet.py` and `script_small.py`). The process is as follows: 1. **Stream Data:** The script streams each source dataset from the Hugging Face Hub to minimize local storage requirements. 2. **Process in Parallel:** Data is processed in batches using Python's `multiprocessing` to leverage all available CPU cores. 3. **Format Unification:** Each entry is converted from its original format (e.g., `flat`, `mcq`, conversational) into the standardized chat structure. 4. **Filter & Truncate:** Conversations are validated for correctness (e.g., must start with a user turn). They are truncated or skipped if their tokenized length exceeds the `MAX_TOKENS` limit (512). 5. **Deduplicate:** A hash of each processed entry is generated (using `xxhash` for speed), and only unique entries are kept, ensuring no duplicates exist within or across datasets. 6. **Save Output:** The final, clean entries are saved to the Parquet and plain text files, along with a `data.json` file containing detailed statistics about the creation process. ## 🛠️ The truly open-source repos Kyoto-Corpus is a key component of my initiative to build and release a complete, truly open-source stack for language modeling. All components are designed to work together seamlessly. * **Tokenizer:** **[Hastings](https://github.com/Nikityyy/Hastings)** - A modern, efficient tokenizer with a 32k vocabulary. * **Dataset:** **[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)** (this repository) - A high-quality, small-scale dataset for instruction tuning. * **Model:** **[lille](https://github.com/Nikityyy/lille)** - A powerful 130-million-parameter model trained from scratch using the Hastings tokenizer. * **Optimizer:** **[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - A memory-efficient, Triton-based implementation of the SophiaG optimizer. * **Evaluations:** **[simple-eval](https://github.com/Nikityyy/simple-eval)** - A straightforward framework for evaluating model performance using an LLM as a Judge. --- ## 📜 License This project is licensed under the MIT License. See the [LICENSE](https://github.com/Nikityyy/Kyoto-Corpus/blob/main/LICENSE) file for details.

# 京都语料库（Kyoto-Corpus） **Kyoto-Corpus**是一款专为小语言模型（Small Language Models, SLMs）指令微调打造的高质量小规模数据集。 ![Lille-Header](assets/lille-header.png) "Kyoto-Corpus"的设计理念为"质量胜于数量"。它并非全新构建的数据集，而是从一批优质公开指令与对话数据集中精心筛选、提纯并统一整合而来，最终形成一套干净、多样且高效的语料库，可用于训练如**Lille-130M-Instruct**这类性能优异的模型。 --- ## ✨ 核心特性 * **多样且高质量的数据源**：本语料库聚合了多个广受认可的数据集，涵盖通用对话、指令遵循、数学推理与知识型问答等场景。 * **统一对话格式**：所有数据均通过特殊Token（`<|startoftext|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`）标准化为统一的对话格式，可直接与**[Hastings](https://github.com/Nikityyy/Hastings)** 分词器配合使用。 * **严格筛选与去重**：构建流程采用了严苛的质量管控手段，包括过滤过长对话（最长支持512个Token）、确保对话回合结构合规，以及消除所有源数据集间的重复条目。 * **专为小模型优化**：基于Token长度限制与精心筛选的特性，该数据集特别适合小语言模型的训练与微调，无需耗费海量计算资源。 * **多格式支持**：数据集提供两种存储格式： * **Parquet** * **纯文本（Plain Text）** * **透明可复现**：本仓库包含了从源数据集生成完整语料库的全部脚本，确保整个流程完全透明可复现。 ## 📊 数据集构成 Kyoto-Corpus整合了以下开源数据集。通过构建脚本对合并后的数据进行处理、筛选与去重，最终得到本语料库。 | 源数据集 | 类型 | 原始Hugging Face链接 | | :--- | :--- | :--- | | **ultrachat_200k** | 通用型 | [`HuggingFaceH4/ultrachat_200k`](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | | **smoltalk2** | 通用型 | [`HuggingFaceTB/smoltalk2`](https://huggingface.co/datasets/HuggingFaceTB/smoltalk2) | | **smol-smoltalk** | 通用型 | [`HuggingFaceTB/smol-smoltalk`](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) | | **WildChat-1M** | 通用型 | [`allenai/WildChat-1M`](https://huggingface.co/datasets/allenai/WildChat-1M) | | **WizardLM_evol_instruct_V2** | 通用型 | [`WizardLMTeam/WizardLM_evol_instruct_V2_196k`](https://huggingface.co/datasets/WizardLMTeam/WizardLM_evol_instruct_V2_196k) | | **ifeval-like-data** | 指令型 | [`argilla/ifeval-like-data`](https://huggingface.co/datasets/argilla/ifeval-like-data) | | **tulu-3-sft-personas** | 指令型 | [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) | | **mmlu** | 知识型 | [`cais/mmlu`](https://huggingface.co/datasets/cais/mmlu) | | **gsm8k** | 数学型 | [`openai/gsm8k`](https://huggingface.co/datasets/openai/gsm8k) | | **math_qa** | 数学型 | [`allenai/math_qa`](https://huggingface.co/datasets/allenai/math_qa) | | **MetaMathQA** | 数学型 | [`meta-math/MetaMathQA`](https://huggingface.co/datasets/meta-math/MetaMathQA) | ## 📝 数据格式数据集中的每一条目均遵循严格的对话结构。 #### Parquet格式（结构化） Parquet文件包含`messages`列与`hf_dataset`列：前者存储字典列表，后者标注原始数据来源。 json { "messages": [ {"role": "user", "content": "日本的首都是哪里？"}, {"role": "assistant", "content": "日本的首都是东京。"} ], "hf_dataset": "Username/Repository" } #### 纯文本格式 `train.txt`文件中存储了每条对话的格式化字符串，可直接用于分词： <|startoftext|><|user|>日本的首都是哪里？<|assistant|>日本的首都是东京。<|endoftext|> ## 🚀 使用方法你可以通过`datasets`库从Hugging Face Hub快速加载Kyoto-Corpus。 python from datasets import load_dataset ds_parquet = load_dataset("Nikityyy/Kyoto-Corpus", split="train") print(ds_parquet[0]) ## 🛠️ 构建流程完整语料库通过本仓库中的脚本（`script_parquet.py`与`script_small.py`）生成，流程如下： 1. **流式加载数据**：脚本从Hugging Face Hub流式读取每个源数据集，以最小化本地存储占用。 2. **并行处理**：使用Python的`multiprocessing`模块将数据分批处理，充分利用所有可用CPU核心。 3. **格式统一**：将每条原始数据（如`flat`、`mcq`、对话式格式）转换为标准化的对话结构。 4. **筛选与截断**：验证对话的合规性（例如必须以用户发言开头），若分词后长度超过`MAX_TOKENS`阈值（512），则截断或直接跳过该对话。 5. **去重处理**：为每条处理后的条目生成哈希值（使用`xxhash`以保证速度），仅保留唯一条目，确保数据集内部及跨源无重复数据。 6. **保存输出**：将最终清洗后的条目保存为Parquet与纯文本文件，同时生成`data.json`文件，包含构建流程的详细统计信息。 ## 🛠️ 真正的开源栈 Kyoto-Corpus是我构建并发布完整、真正开源语言模型栈的核心组件之一，所有组件均可无缝协同工作。 * **分词器**：**[Hastings](https://github.com/Nikityyy/Hastings)** - 一款现代高效的分词器，拥有32k词表。 * **数据集**：**[Kyoto-Corpus](https://github.com/Nikityyy/Kyoto-Corpus)**（本仓库） - 专为指令微调打造的高质量小规模数据集。 * **模型**：**[lille](https://github.com/Nikityyy/lille)** - 一款基于Hastings分词器从头训练的1.3亿参数高性能模型。 * **优化器**：**[Sophia-Triton](https://github.com/Nikityyy/Sophia-Triton)** - 基于Triton实现的内存高效型SophiaG优化器。 * **评估框架**：**[simple-eval](https://github.com/Nikityyy/simple-eval)** - 一款简洁的模型评估框架，支持以大语言模型作为评判者进行性能评估。 --- ## 📜 许可证本项目采用MIT许可证，详情请参见[LICENSE](https://github.com/Nikityyy/Kyoto-Corpus/blob/main/LICENSE)文件。

提供机构：

maas

创建时间：

2025-09-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集