tsuki-team/Tsuki-dataset

Name: tsuki-team/Tsuki-dataset
Creator: tsuki-team
Published: 2026-03-31 18:09:57
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/tsuki-team/Tsuki-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - es - en task_categories: - text-generation - summarization task_ids: - text-simplification - abstractive-qa tags: - token-compression - prompt-optimization - llm-cost-reduction - flan-t5 - instruction-following - bilingual - tsuki pretty_name: Tsuki Token Compression Dataset size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: tsuki_train_15k.jsonl dataset_info: features: - name: inputs dtype: string - name: targets dtype: string - name: task_type dtype: string splits: - name: train num_examples: 15000 description: > Bilingual (es/en) dataset for training token compression models. Contains pairs of verbose text and compressed versions across 6 task types: compression_chat, compression_markdown, compression_technical, reasoning_yes, reasoning_no, and compare_compression. Designed to train models that reduce LLM API costs by compressing prompts without losing critical information. --- <div align="center"> <br> <img src="https://img.shields.io/badge/%E2%9C%A6-TSUKI-000000?style=for-the-badge&labelColor=000000" alt="Tsuki" height="50"> <br><br> # Token Compression Training Dataset **15,000 bilingual compression pairs. Prompts, markdown, and technical instructions.**<br> **Trains models to reduce LLM API costs without losing critical information.** <br> <a href="#task-types"><img src="https://img.shields.io/badge/TASK_TYPES-000000?style=for-the-badge" alt="Task Types"></a>    <a href="#examples"><img src="https://img.shields.io/badge/EXAMPLES-000000?style=for-the-badge" alt="Examples"></a>    <a href="#dataset-info"><img src="https://img.shields.io/badge/DATASET_INFO-000000?style=for-the-badge" alt="Dataset Info"></a> <br><br> [![License](https://img.shields.io/badge/Apache_2.0-222222?style=flat-square&logo=apache&logoColor=white)](LICENSE)   [![Bilingual](https://img.shields.io/badge/ES%20%7C%20EN-222222?style=flat-square&logoColor=white)](#dataset-info)   [![Flan-T5](https://img.shields.io/badge/Flan--T5_Ready-222222?style=flat-square&logo=google&logoColor=white)](#usage)   [![15K pairs](https://img.shields.io/badge/15K_pairs-222222?style=flat-square&logoColor=white)](#dataset-info) <br> --- <br> <table> <tr> <td width="50%" valign="top"> **Real compression pairs.**<br><br> Verbose prompts → minimum tokens.<br> Markdown sections → essential content.<br> Technical instructions → direct commands.<br> Reasoning examples included. </td> <td width="50%" valign="top"> **Knows when not to compress.**<br><br> Legal text preserved intact.<br> Medical instructions unchanged.<br> Safety-critical content protected.<br> Security protocols left untouched. </td> </tr> </table> <br> </div> --- <br> <div align="center"> ## What is this dataset? </div> <br> This is the official training dataset for the **Tsuki token compression model**, maintained by the [Tsuki-team](https://huggingface.co/tsuki-team). It contains 4,160 carefully curated input-output pairs designed to teach a model the difference between verbose text and its semantic minimum — without losing any critical information. The dataset covers three real-world text types: **chat prompts** filled with pleasantries and filler, **markdown documents** with redundant introductions and verbose section headers, and **technical instructions** padded with unnecessary context. Each pair is accompanied by task metadata that enables multi-task fine-tuning. Trained on this dataset, Tsuki reduces LLM API token usage by compressing verbose inputs before they reach GPT, Claude, or any other paid inference endpoint. <br> --- <br> <div align="center"> ## Task Types </div> <br> | Task | Language | Description | |:-----|:---------|:------------| | `compression_chat` | ES / EN | Strip pleasantries, greetings, and filler from chat prompts | | `compression_markdown` | ES / EN | Remove verbose introductions and redundant markdown sections | | `compression_technical` | ES / EN | Reduce technical instructions to their essential commands | | `reasoning_yes` | ES / EN | Analyze verbosity and compress with explicit justification | | `reasoning_no` | ES / EN | Detect when compression would cause critical information loss | | `compare_compression` | ES / EN | Evaluate two compressed versions and select the optimal one | <br> --- <br> <div align="center"> ## Examples </div> <br> **compression_chat** ``` Input: "Hola! Espero que estés muy bien. Quería preguntarte, si no es mucha molestia, si podrías explicarme cómo funciona exactamente el proceso de autenticación con JWT en una API REST" Output: "¿Cómo funciona autenticación JWT en API REST?" ``` <br> **compression_technical** ``` Input: "Para reiniciar completamente la base de datos de desarrollo y volver a aplicar todas las migraciones desde cero, lo que te dará una base de datos limpia en su estado inicial, ejecuta el siguiente comando de Prisma: npx prisma migrate reset --force" Output: "Ejecuta: npx prisma migrate reset --force" ``` <br> **compression_markdown** ``` Input: "## TypeScript generics En esta sección se explica de manera detallada y completa cómo funciona el uso de genéricos en TypeScript para crear funciones y tipos reutilizables, incluyendo todos los aspectos importantes que el desarrollador necesita conocer para implementarlo correctamente en su proyecto." Output: "## TypeScript generics Explica TypeScript generics: funciones y tipos reutilizables." ``` <br> **reasoning_no** ``` Input: "El producto debe almacenarse a temperatura entre 2°C y 8°C, alejado de la luz directa, en posición vertical y fuera del alcance de niños menores de 3 años." Output: "No comprimir. Instrucciones de almacenamiento farmacéutico con rangos de temperatura y condiciones específicas. Comprimir resultaría en pérdida crítica de información. Riesgo: Alta." ``` <br> **compare_compression** ``` Input: Texto: "Si encuentras algún error durante el proceso de instalación, lo primero que deberías hacer es revisar los logs de error, verificar que todos los puertos requeridos estén disponibles y asegurarte de que las dependencias del sistema estén instaladas." A: "Error en instalación: revisa logs, verifica puertos disponibles, confirma dependencias." B: "Si hay error en instalación: revisa logs de error, verifica puertos requeridos y dependencias del sistema." Output: "Opción A es mejor. Elimina conectores redundantes y usa formato de lista implícita. Opción B mantiene algunas palabras innecesarias." ``` <br> --- <br> <div align="center"> ## Usage </div> <br> ### Load with datasets ```python from datasets import load_dataset dataset = load_dataset("tsuki-team/tsuki-compression-15k") train = dataset["train"] print(train[0]) # { # "inputs": "Convierte este prompt al mínimo de tokens...", # "targets": "...", # "task_type": "compression_chat" # } ``` <br> ### Fine-tune with Flan-T5 ```python from transformers import T5ForConditionalGeneration, T5Tokenizer from datasets import load_dataset model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base") tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base") dataset = load_dataset("tsuki-team/tsuki-compression-15k") def tokenize(batch): inputs = tokenizer(batch["inputs"], truncation=True, padding="max_length", max_length=256) targets = tokenizer(batch["targets"], truncation=True, padding="max_length", max_length=128) inputs["labels"] = targets["input_ids"] return inputs tokenized = dataset.map(tokenize, batched=True) ``` <br> ### Filter by task type ```python # Only chat compression examples chat_only = dataset["train"].filter(lambda x: x["task_type"] == "compression_chat") # Only reasoning examples (yes + no) reasoning = dataset["train"].filter(lambda x: "reasoning" in x["task_type"]) # Only English examples english = dataset["train"].filter(lambda x: x["inputs"].strip()[0].isascii()) ``` <br> --- <br> <div align="center"> ## Dataset Info </div> <br> | Field | Value | |:------|:------| | Total examples | 4,160 | | Languages | Spanish (ES), English (EN) | | Split | Train only | | Format | JSONL | | Input field | `inputs` | | Target field | `targets` | | Metadata field | `task_type` | <br> ### Design decisions <details> <summary><strong>Why real text only?</strong></summary> <br> Early dataset versions used synthetic surrealist text ("the crystal dragon played cards with light caves"). Models trained on that data learned to compress fiction, not prompts. Every example in this dataset is grounded in real developer workflows: git commands, API documentation, chat messages, markdown READMEs. </details> <details> <summary><strong>Why include reasoning_no examples?</strong></summary> <br> A compression model that always compresses is dangerous. Legal clauses, pharmaceutical storage instructions, security protocols, and lab safety warnings must never be shortened. The `reasoning_no` task type teaches the model to recognize when information density is already at its minimum and compression would cause critical loss. </details> <details> <summary><strong>Why bilingual?</strong></summary> <br> Real-world prompts mix languages. Developer documentation, error messages, and API calls often appear in English even when the surrounding conversation is in Spanish. A bilingual model handles both without needing language detection preprocessing. </details> <details> <summary><strong>Why compare_compression?</strong></summary> <br> Compression quality is not binary. Two compressed versions of the same text can both be shorter than the original while one preserves meaning and one loses it. The `compare_compression` task trains the model to evaluate quality, not just apply reduction. </details> <br> --- <br> <div align="center"> ## Related Projects </div> <br> | Project | Description | |:--------|:------------| | Tsuki Model | Fine-tuned Flan-T5-Base for token compression *(coming soon)* | | Tsuki CLI | Command-line interface for compressing prompts locally *(coming soon)* | <br> --- <br> <div align="center"> ## License </div> <br> ``` Apache License 2.0 Copyright (c) 2026 Tsuki-team Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ``` <br> --- <br> <div align="center"> **Built with real prompts, real markdown, and zero surrealist dragons.** <br> [![Tsuki Project](https://img.shields.io/badge/Tsuki_Project-2026-000000?style=for-the-badge)](https://huggingface.co/tsuki-team) <br> </div>

提供机构：

tsuki-team

5,000+

优质数据集

54 个

任务类型

进入经典数据集