typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset

Name: typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset
Creator: maas
Published: 2025-09-01 16:32:50
License: 暂无描述

魔搭社区2025-09-01 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/scb10x/typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# Typhoon T1 3B ICLR 2025 SCI-FM Workshop Dataset **Paper Title**: Typhoon T1: An Open Thai Reasoning Model **Venue**: Open Science for Foundation Models (SCI-FM), ICLR 2025 **Paper Link**: [https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042) **Authors**: Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, and Kunat Pipatanakul ## Dataset Details This dataset is part of the experiments in the paper [Typhoon T1: An Open Thai Reasoning Model](https://arxiv.org/abs/2502.09042), accepted at SCI-FM, ICLR 2025. Please refer to the paper for more details. It's available in Alpaca format (`{instruction, input, output}`), although `input` for all records is null. ## Data Splits - `train_structured`: This split contains a structured thinking training set used for the experiments in Sections 3.1–3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`. - `train_unstructured`: This split contains an unstructured thinking training set used for the experiment in Section 3.1. - `train_semi_structured`: This split contains a semi-structured thinking training set used for the experiment in Section 3.1. - `train_structured_thai`: This split contains 1.5K Thai-translated structured thinking training examples used for the experiments in Section 3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`. ## Data Mixture This dataset consists of 55,677 records for SFT training with the following distribution: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/615313b0793ef66b3324da1f/xi6q1nydpQnzKNUGo2ITx.png) ## Attributes - `instruction`: An instruction. - `input`: All inputs are null in this dataset, but included for compatibility with trainers. - `output`: Long-form reasoning generated using the approach described in our paper. ## Citation ``` @misc{taveekitworachai2025typhoont1openthai, title={Typhoon T1: An Open Thai Reasoning Model}, author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul}, year={2025}, eprint={2502.09042}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.09042}, } ```

# 台风T1 3B ICLR 2025 SCI-FM研讨会数据集 **论文标题**：Typhoon T1：一款开源泰语推理模型 **发表会议**：国际学习表征会议（ICLR 2025）旗下开放基础模型科学（SCI-FM）研讨会 **论文链接**：[https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042) **作者**：Pittawat Taveekitworachai、Potsawee Manakul、Kasima Tharnpipitchai 与 Kunat Pipatanakul ## 数据集详情本数据集为收录于ICLR 2025 SCI-FM研讨会论文《Typhoon T1：一款开源泰语推理模型》（[https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042)）的配套实验数据集，更多细节请参阅原文。该数据集采用Alpaca格式存储，格式为`{"instruction", "input", "output"}`，不过所有条目的`input`字段均为null。 ## 数据划分 - `train_structured`：该划分包含结构化思维训练集，用于论文3.1至3.4节的实验。若需对该划分进行下采样，我们采用了`.shuffle(seed=2024).select(n)`的操作流程。 - `train_unstructured`：该划分包含非结构化思维训练集，用于论文3.1节的实验。 - `train_semi_structured`：该划分包含半结构化思维训练集，用于论文3.1节的实验。 - `train_structured_thai`：该划分包含1500条经泰语翻译的结构化思维训练样本，用于论文3.4节的实验。若需对该划分进行下采样，我们采用了`.shuffle(seed=2024).select(n)`的操作流程。 ## 数据混合方案本数据集共包含55677条用于监督微调（Supervised Fine-Tuning, SFT）训练的样本，其分布如下：![image/png](https://cdn-uploads.huggingface.co/production/uploads/615313b0793ef66b3324da1f/xi6q1nydpQnzKNUGo2ITx.png) ## 数据字段说明 - `instruction`：指令文本。 - `input`：本数据集所有条目的该字段均为null，但为兼容各类训练器仍保留该字段。 - `output`：基于本文所述方法生成的长文本推理内容。 ## 引用格式 @misc{taveekitworachai2025typhoont1openthai, title={Typhoon T1: An Open Thai Reasoning Model}, author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul}, year={2025}, eprint={2502.09042}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.09042}, }

提供机构：

maas

创建时间：

2025-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集