typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset
收藏魔搭社区2025-09-01 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/scb10x/typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Typhoon T1 3B ICLR 2025 SCI-FM Workshop Dataset
**Paper Title**: Typhoon T1: An Open Thai Reasoning Model
**Venue**: Open Science for Foundation Models (SCI-FM), ICLR 2025
**Paper Link**: [https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042)
**Authors**: Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, and Kunat Pipatanakul
## Dataset Details
This dataset is part of the experiments in the paper [Typhoon T1: An Open Thai Reasoning Model](https://arxiv.org/abs/2502.09042), accepted at SCI-FM, ICLR 2025. Please refer to the paper for more details.
It's available in Alpaca format (`{instruction, input, output}`), although `input` for all records is null.
## Data Splits
- `train_structured`: This split contains a structured thinking training set used for the experiments in Sections 3.1–3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`.
- `train_unstructured`: This split contains an unstructured thinking training set used for the experiment in Section 3.1.
- `train_semi_structured`: This split contains a semi-structured thinking training set used for the experiment in Section 3.1.
- `train_structured_thai`: This split contains 1.5K Thai-translated structured thinking training examples used for the experiments in Section 3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`.
## Data Mixture
This dataset consists of 55,677 records for SFT training with the following distribution:

## Attributes
- `instruction`: An instruction.
- `input`: All inputs are null in this dataset, but included for compatibility with trainers.
- `output`: Long-form reasoning generated using the approach described in our paper.
## Citation
```
@misc{taveekitworachai2025typhoont1openthai,
title={Typhoon T1: An Open Thai Reasoning Model},
author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul},
year={2025},
eprint={2502.09042},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09042},
}
```
# 台风T1 3B ICLR 2025 SCI-FM研讨会数据集
**论文标题**:Typhoon T1:一款开源泰语推理模型
**发表会议**:国际学习表征会议(ICLR 2025)旗下开放基础模型科学(SCI-FM)研讨会
**论文链接**:[https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042)
**作者**:Pittawat Taveekitworachai、Potsawee Manakul、Kasima Tharnpipitchai 与 Kunat Pipatanakul
## 数据集详情
本数据集为收录于ICLR 2025 SCI-FM研讨会论文《Typhoon T1:一款开源泰语推理模型》([https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042))的配套实验数据集,更多细节请参阅原文。
该数据集采用Alpaca格式存储,格式为`{"instruction", "input", "output"}`,不过所有条目的`input`字段均为null。
## 数据划分
- `train_structured`:该划分包含结构化思维训练集,用于论文3.1至3.4节的实验。若需对该划分进行下采样,我们采用了`.shuffle(seed=2024).select(n)`的操作流程。
- `train_unstructured`:该划分包含非结构化思维训练集,用于论文3.1节的实验。
- `train_semi_structured`:该划分包含半结构化思维训练集,用于论文3.1节的实验。
- `train_structured_thai`:该划分包含1500条经泰语翻译的结构化思维训练样本,用于论文3.4节的实验。若需对该划分进行下采样,我们采用了`.shuffle(seed=2024).select(n)`的操作流程。
## 数据混合方案
本数据集共包含55677条用于监督微调(Supervised Fine-Tuning, SFT)训练的样本,其分布如下:
## 数据字段说明
- `instruction`:指令文本。
- `input`:本数据集所有条目的该字段均为null,但为兼容各类训练器仍保留该字段。
- `output`:基于本文所述方法生成的长文本推理内容。
## 引用格式
@misc{taveekitworachai2025typhoont1openthai,
title={Typhoon T1: An Open Thai Reasoning Model},
author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul},
year={2025},
eprint={2502.09042},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09042},
}
提供机构:
maas
创建时间:
2025-05-23



