corre-social/s1_dataset_ptbr_1k_tokenized

Name: corre-social/s1_dataset_ptbr_1k_tokenized
Creator: corre-social
Published: 2025-12-10 14:47:36
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/corre-social/s1_dataset_ptbr_1k_tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: solution dtype: string - name: question dtype: string - name: cot_type dtype: string - name: source_type dtype: string - name: metadata dtype: string - name: cot dtype: 'null' - name: thinking_trajectories list: string - name: attempt dtype: string - name: text dtype: string splits: - name: train num_bytes: 30842229 num_examples: 1000 download_size: 12612548 dataset_size: 30842229 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-generation language: - pt tags: - tokenized - sft - pre-processed - tucano - llama size_categories: - 1K<n<10K source_datasets: - corre-social/s1_dataset_ptbr_1k --- # s1_dataset_ptbr_1k_tokenized ## Resumo do Dataset Este é o dataset **s1_dataset_ptbr_1k** pré-processado e tokenizado, pronto para o treinamento (Fine-Tuning) de modelos baseados na arquitetura Llama/Tucano. Ele foi gerado aplicando um template específico de instrução e raciocínio ("Thinking Process") e convertido para IDs utilizando o tokenizador do **Tucano-1b1-Instruct**. - **Dataset Original:** [corre-social/s1_dataset_ptbr_1k](https://huggingface.co/datasets/corre-social/s1_dataset_ptbr_1k) - **Tokenizer Base:** [TucanoBR/Tucano-1b1-Instruct](https://huggingface.co/TucanoBR/Tucano-1b1-Instruct) - **Block Size (Contexto):** 2048 tokens ## Formato e Template Os dados foram formatados utilizando uma estrutura específica para estimular o *Chain of Thought* (Cadeia de Pensamento). O formato aplicado antes da tokenização foi: ```text <instruction> {pergunta_original} </instruction> <|im_start|>think {thinking_trajectories} {solution}

提供机构：

corre-social

5,000+

优质数据集

54 个

任务类型

进入经典数据集