tsuki-team/Tsuki-dataset
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tsuki-team/Tsuki-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- es
- en
task_categories:
- text-generation
- summarization
task_ids:
- text-simplification
- abstractive-qa
tags:
- token-compression
- prompt-optimization
- llm-cost-reduction
- flan-t5
- instruction-following
- bilingual
- tsuki
pretty_name: Tsuki Token Compression Dataset
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: tsuki_train_15k.jsonl
dataset_info:
features:
- name: inputs
dtype: string
- name: targets
dtype: string
- name: task_type
dtype: string
splits:
- name: train
num_examples: 15000
description: >
Bilingual (es/en) dataset for training token compression models.
Contains pairs of verbose text and compressed versions across 6 task types:
compression_chat, compression_markdown, compression_technical,
reasoning_yes, reasoning_no, and compare_compression.
Designed to train models that reduce LLM API costs by compressing
prompts without losing critical information.
---
<div align="center">
<br>
<img src="https://img.shields.io/badge/%E2%9C%A6-TSUKI-000000?style=for-the-badge&labelColor=000000" alt="Tsuki" height="50">
<br><br>
# Token Compression Training Dataset
**15,000 bilingual compression pairs. Prompts, markdown, and technical instructions.**<br>
**Trains models to reduce LLM API costs without losing critical information.**
<br>
<a href="#task-types"><img src="https://img.shields.io/badge/TASK_TYPES-000000?style=for-the-badge" alt="Task Types"></a>
<a href="#examples"><img src="https://img.shields.io/badge/EXAMPLES-000000?style=for-the-badge" alt="Examples"></a>
<a href="#dataset-info"><img src="https://img.shields.io/badge/DATASET_INFO-000000?style=for-the-badge" alt="Dataset Info"></a>
<br><br>
[](LICENSE)
[](#dataset-info)
[](#usage)
[](#dataset-info)
<br>
---
<br>
<table>
<tr>
<td width="50%" valign="top">
**Real compression pairs.**<br><br>
Verbose prompts → minimum tokens.<br>
Markdown sections → essential content.<br>
Technical instructions → direct commands.<br>
Reasoning examples included.
</td>
<td width="50%" valign="top">
**Knows when not to compress.**<br><br>
Legal text preserved intact.<br>
Medical instructions unchanged.<br>
Safety-critical content protected.<br>
Security protocols left untouched.
</td>
</tr>
</table>
<br>
</div>
---
<br>
<div align="center">
## What is this dataset?
</div>
<br>
This is the official training dataset for the **Tsuki token compression model**, maintained by the [Tsuki-team](https://huggingface.co/tsuki-team). It contains 4,160 carefully curated input-output pairs designed to teach a model the difference between verbose text and its semantic minimum — without losing any critical information.
The dataset covers three real-world text types: **chat prompts** filled with pleasantries and filler, **markdown documents** with redundant introductions and verbose section headers, and **technical instructions** padded with unnecessary context. Each pair is accompanied by task metadata that enables multi-task fine-tuning.
Trained on this dataset, Tsuki reduces LLM API token usage by compressing verbose inputs before they reach GPT, Claude, or any other paid inference endpoint.
<br>
---
<br>
<div align="center">
## Task Types
</div>
<br>
| Task | Language | Description |
|:-----|:---------|:------------|
| `compression_chat` | ES / EN | Strip pleasantries, greetings, and filler from chat prompts |
| `compression_markdown` | ES / EN | Remove verbose introductions and redundant markdown sections |
| `compression_technical` | ES / EN | Reduce technical instructions to their essential commands |
| `reasoning_yes` | ES / EN | Analyze verbosity and compress with explicit justification |
| `reasoning_no` | ES / EN | Detect when compression would cause critical information loss |
| `compare_compression` | ES / EN | Evaluate two compressed versions and select the optimal one |
<br>
---
<br>
<div align="center">
## Examples
</div>
<br>
**compression_chat**
```
Input:
"Hola! Espero que estés muy bien. Quería preguntarte, si no es mucha molestia,
si podrías explicarme cómo funciona exactamente el proceso de autenticación
con JWT en una API REST"
Output:
"¿Cómo funciona autenticación JWT en API REST?"
```
<br>
**compression_technical**
```
Input:
"Para reiniciar completamente la base de datos de desarrollo y volver a aplicar
todas las migraciones desde cero, lo que te dará una base de datos limpia en su
estado inicial, ejecuta el siguiente comando de Prisma:
npx prisma migrate reset --force"
Output:
"Ejecuta: npx prisma migrate reset --force"
```
<br>
**compression_markdown**
```
Input:
"## TypeScript generics
En esta sección se explica de manera detallada y completa cómo funciona el uso
de genéricos en TypeScript para crear funciones y tipos reutilizables, incluyendo
todos los aspectos importantes que el desarrollador necesita conocer para
implementarlo correctamente en su proyecto."
Output:
"## TypeScript generics
Explica TypeScript generics: funciones y tipos reutilizables."
```
<br>
**reasoning_no**
```
Input:
"El producto debe almacenarse a temperatura entre 2°C y 8°C, alejado de la luz
directa, en posición vertical y fuera del alcance de niños menores de 3 años."
Output:
"No comprimir. Instrucciones de almacenamiento farmacéutico con rangos de
temperatura y condiciones específicas. Comprimir resultaría en pérdida crítica
de información. Riesgo: Alta."
```
<br>
**compare_compression**
```
Input:
Texto: "Si encuentras algún error durante el proceso de instalación, lo primero
que deberías hacer es revisar los logs de error, verificar que todos los puertos
requeridos estén disponibles y asegurarte de que las dependencias del sistema
estén instaladas."
A: "Error en instalación: revisa logs, verifica puertos disponibles, confirma dependencias."
B: "Si hay error en instalación: revisa logs de error, verifica puertos requeridos
y dependencias del sistema."
Output:
"Opción A es mejor. Elimina conectores redundantes y usa formato de lista implícita.
Opción B mantiene algunas palabras innecesarias."
```
<br>
---
<br>
<div align="center">
## Usage
</div>
<br>
### Load with datasets
```python
from datasets import load_dataset
dataset = load_dataset("tsuki-team/tsuki-compression-15k")
train = dataset["train"]
print(train[0])
# {
# "inputs": "Convierte este prompt al mínimo de tokens...",
# "targets": "...",
# "task_type": "compression_chat"
# }
```
<br>
### Fine-tune with Flan-T5
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
dataset = load_dataset("tsuki-team/tsuki-compression-15k")
def tokenize(batch):
inputs = tokenizer(batch["inputs"], truncation=True, padding="max_length", max_length=256)
targets = tokenizer(batch["targets"], truncation=True, padding="max_length", max_length=128)
inputs["labels"] = targets["input_ids"]
return inputs
tokenized = dataset.map(tokenize, batched=True)
```
<br>
### Filter by task type
```python
# Only chat compression examples
chat_only = dataset["train"].filter(lambda x: x["task_type"] == "compression_chat")
# Only reasoning examples (yes + no)
reasoning = dataset["train"].filter(lambda x: "reasoning" in x["task_type"])
# Only English examples
english = dataset["train"].filter(lambda x: x["inputs"].strip()[0].isascii())
```
<br>
---
<br>
<div align="center">
## Dataset Info
</div>
<br>
| Field | Value |
|:------|:------|
| Total examples | 4,160 |
| Languages | Spanish (ES), English (EN) |
| Split | Train only |
| Format | JSONL |
| Input field | `inputs` |
| Target field | `targets` |
| Metadata field | `task_type` |
<br>
### Design decisions
<details>
<summary><strong>Why real text only?</strong></summary>
<br>
Early dataset versions used synthetic surrealist text ("the crystal dragon played cards with light caves"). Models trained on that data learned to compress fiction, not prompts. Every example in this dataset is grounded in real developer workflows: git commands, API documentation, chat messages, markdown READMEs.
</details>
<details>
<summary><strong>Why include reasoning_no examples?</strong></summary>
<br>
A compression model that always compresses is dangerous. Legal clauses, pharmaceutical storage instructions, security protocols, and lab safety warnings must never be shortened. The `reasoning_no` task type teaches the model to recognize when information density is already at its minimum and compression would cause critical loss.
</details>
<details>
<summary><strong>Why bilingual?</strong></summary>
<br>
Real-world prompts mix languages. Developer documentation, error messages, and API calls often appear in English even when the surrounding conversation is in Spanish. A bilingual model handles both without needing language detection preprocessing.
</details>
<details>
<summary><strong>Why compare_compression?</strong></summary>
<br>
Compression quality is not binary. Two compressed versions of the same text can both be shorter than the original while one preserves meaning and one loses it. The `compare_compression` task trains the model to evaluate quality, not just apply reduction.
</details>
<br>
---
<br>
<div align="center">
## Related Projects
</div>
<br>
| Project | Description |
|:--------|:------------|
| Tsuki Model | Fine-tuned Flan-T5-Base for token compression *(coming soon)* |
| Tsuki CLI | Command-line interface for compressing prompts locally *(coming soon)* |
<br>
---
<br>
<div align="center">
## License
</div>
<br>
```
Apache License 2.0
Copyright (c) 2026 Tsuki-team
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
<br>
---
<br>
<div align="center">
**Built with real prompts, real markdown, and zero surrealist dragons.**
<br>
[](https://huggingface.co/tsuki-team)
<br>
</div>
提供机构:
tsuki-team



