five

wannaphong/typhoon-s-sovereign-capability-dataset

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/wannaphong/typhoon-s-sovereign-capability-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Typhoon-S Instruct Post-Training tags: - reinforcement fine-tuning - tool-use - thai - english - sovereign-ai task_categories: - text-generation language: - th - en license: odc-by --- # Typhoon-S Training Assets Training and evaluation datasets for Section 3, Thai language models used in the Typhoon-S project. ## Datasets **NitiBench (Legal Domain)** - `nitibench_train_rl.parquet` - RL training set (8,211 examples) - `nitibench_train_pretrain.parquet` - Pretrain set (3,648 examples) - `nitibench_train_sft.parquet` - SFT set (3,648 examples) - `nitibench_test.parquet` - Test set (373 examples) (10% of https://huggingface.co/datasets/VISAI-AI/nitibench ccl split) - `nitibench_train_rl_agent.parquet` - Agent RL training (8,211 examples) Original source - https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG - https://huggingface.co/datasets/VISAI-AI/nitibench **MIRAGE (General Domain)** - `mirage_train_rl.parquet` - RL training set - `mirage_train_pretrain.parquet` - Pretrain set - `mirage_test.parquet` - Test set Original source - https://huggingface.co/datasets/nthakur/mirage-bench-instruct - https://huggingface.co/datasets/nthakur/mirage-bench ## Usage ```python from datasets import load_dataset # Load a single file dataset = load_dataset("typhoon-ai/typhoon-s-sovereign-capability-dataset", data_files="nitibench_train_rl.parquet") # Load multiple files dataset = load_dataset( "typhoon-ai/typhoon-s-sovereign-capability-dataset", data_files={ "train": "nitibench_train_rl.parquet", "test": "nitibench_test.parquet" } ) ``` ## More Information Please see for more details: https://github.com/scb-10x/typhoon-s ## Citation If you use this dataset, please cite the dataset repository and the associated Typhoon-S technical report: ```bibtex @misc{pipatanakul2026typhoonsminimalopenposttraining, title={Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models}, author={Kunat Pipatanakul and Pittawat Taveekitworachai}, year={2026}, eprint={2601.18129}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2601.18129}, } ```
提供机构:
wannaphong
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作