wannaphong/typhoon-s-sovereign-capability-dataset
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/wannaphong/typhoon-s-sovereign-capability-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Typhoon-S Instruct Post-Training
tags:
- reinforcement fine-tuning
- tool-use
- thai
- english
- sovereign-ai
task_categories:
- text-generation
language:
- th
- en
license: odc-by
---
# Typhoon-S Training Assets
Training and evaluation datasets for Section 3, Thai language models used in the Typhoon-S project.
## Datasets
**NitiBench (Legal Domain)**
- `nitibench_train_rl.parquet` - RL training set (8,211 examples)
- `nitibench_train_pretrain.parquet` - Pretrain set (3,648 examples)
- `nitibench_train_sft.parquet` - SFT set (3,648 examples)
- `nitibench_test.parquet` - Test set (373 examples) (10% of https://huggingface.co/datasets/VISAI-AI/nitibench ccl split)
- `nitibench_train_rl_agent.parquet` - Agent RL training (8,211 examples)
Original source
- https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG
- https://huggingface.co/datasets/VISAI-AI/nitibench
**MIRAGE (General Domain)**
- `mirage_train_rl.parquet` - RL training set
- `mirage_train_pretrain.parquet` - Pretrain set
- `mirage_test.parquet` - Test set
Original source
- https://huggingface.co/datasets/nthakur/mirage-bench-instruct
- https://huggingface.co/datasets/nthakur/mirage-bench
## Usage
```python
from datasets import load_dataset
# Load a single file
dataset = load_dataset("typhoon-ai/typhoon-s-sovereign-capability-dataset", data_files="nitibench_train_rl.parquet")
# Load multiple files
dataset = load_dataset(
"typhoon-ai/typhoon-s-sovereign-capability-dataset",
data_files={
"train": "nitibench_train_rl.parquet",
"test": "nitibench_test.parquet"
}
)
```
## More Information
Please see for more details: https://github.com/scb-10x/typhoon-s
## Citation
If you use this dataset, please cite the dataset repository and the associated Typhoon-S technical report:
```bibtex
@misc{pipatanakul2026typhoonsminimalopenposttraining,
title={Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models},
author={Kunat Pipatanakul and Pittawat Taveekitworachai},
year={2026},
eprint={2601.18129},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.18129},
}
```
提供机构:
wannaphong



