five

typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset

收藏
魔搭社区2025-09-01 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/scb10x/typhoon-t1-3b-sci-fm-iclr-2025-exp-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Typhoon T1 3B ICLR 2025 SCI-FM Workshop Dataset **Paper Title**: Typhoon T1: An Open Thai Reasoning Model **Venue**: Open Science for Foundation Models (SCI-FM), ICLR 2025 **Paper Link**: [https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042) **Authors**: Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, and Kunat Pipatanakul ## Dataset Details This dataset is part of the experiments in the paper [Typhoon T1: An Open Thai Reasoning Model](https://arxiv.org/abs/2502.09042), accepted at SCI-FM, ICLR 2025. Please refer to the paper for more details. It's available in Alpaca format (`{instruction, input, output}`), although `input` for all records is null. ## Data Splits - `train_structured`: This split contains a structured thinking training set used for the experiments in Sections 3.1–3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`. - `train_unstructured`: This split contains an unstructured thinking training set used for the experiment in Section 3.1. - `train_semi_structured`: This split contains a semi-structured thinking training set used for the experiment in Section 3.1. - `train_structured_thai`: This split contains 1.5K Thai-translated structured thinking training examples used for the experiments in Section 3.4. For subsampling this split, we used `.shuffle(seed=2024).select(n)`. ## Data Mixture This dataset consists of 55,677 records for SFT training with the following distribution: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/615313b0793ef66b3324da1f/xi6q1nydpQnzKNUGo2ITx.png) ## Attributes - `instruction`: An instruction. - `input`: All inputs are null in this dataset, but included for compatibility with trainers. - `output`: Long-form reasoning generated using the approach described in our paper. ## Citation ``` @misc{taveekitworachai2025typhoont1openthai, title={Typhoon T1: An Open Thai Reasoning Model}, author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul}, year={2025}, eprint={2502.09042}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.09042}, } ```

# 台风T1 3B ICLR 2025 SCI-FM研讨会数据集 **论文标题**:Typhoon T1:一款开源泰语推理模型 **发表会议**:国际学习表征会议(ICLR 2025)旗下开放基础模型科学(SCI-FM)研讨会 **论文链接**:[https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042) **作者**:Pittawat Taveekitworachai、Potsawee Manakul、Kasima Tharnpipitchai 与 Kunat Pipatanakul ## 数据集详情 本数据集为收录于ICLR 2025 SCI-FM研讨会论文《Typhoon T1:一款开源泰语推理模型》([https://arxiv.org/abs/2502.09042](https://arxiv.org/abs/2502.09042))的配套实验数据集,更多细节请参阅原文。 该数据集采用Alpaca格式存储,格式为`{"instruction", "input", "output"}`,不过所有条目的`input`字段均为null。 ## 数据划分 - `train_structured`:该划分包含结构化思维训练集,用于论文3.1至3.4节的实验。若需对该划分进行下采样,我们采用了`.shuffle(seed=2024).select(n)`的操作流程。 - `train_unstructured`:该划分包含非结构化思维训练集,用于论文3.1节的实验。 - `train_semi_structured`:该划分包含半结构化思维训练集,用于论文3.1节的实验。 - `train_structured_thai`:该划分包含1500条经泰语翻译的结构化思维训练样本,用于论文3.4节的实验。若需对该划分进行下采样,我们采用了`.shuffle(seed=2024).select(n)`的操作流程。 ## 数据混合方案 本数据集共包含55677条用于监督微调(Supervised Fine-Tuning, SFT)训练的样本,其分布如下:![image/png](https://cdn-uploads.huggingface.co/production/uploads/615313b0793ef66b3324da1f/xi6q1nydpQnzKNUGo2ITx.png) ## 数据字段说明 - `instruction`:指令文本。 - `input`:本数据集所有条目的该字段均为null,但为兼容各类训练器仍保留该字段。 - `output`:基于本文所述方法生成的长文本推理内容。 ## 引用格式 @misc{taveekitworachai2025typhoont1openthai, title={Typhoon T1: An Open Thai Reasoning Model}, author={Pittawat Taveekitworachai and Potsawee Manakul and Kasima Tharnpipitchai and Kunat Pipatanakul}, year={2025}, eprint={2502.09042}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.09042}, }
提供机构:
maas
创建时间:
2025-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作