Polygl0t/gigaverbo-v2-sft

Name: Polygl0t/gigaverbo-v2-sft
Creator: Polygl0t
Published: 2026-03-05 08:52:09
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/gigaverbo-v2-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: code features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 160138633 num_examples: 80774 download_size: 160138633 dataset_size: 160138633 - config_name: function_call features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 21139767 num_examples: 45891 download_size: 21139767 dataset_size: 21139767 - config_name: general features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 1602583173 num_examples: 1235976 download_size: 1602583173 dataset_size: 1602583173 - config_name: math features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 101517517 num_examples: 220042 download_size: 101517517 dataset_size: 101517517 - config_name: math_cot features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 31966848 num_examples: 63413 download_size: 31966848 dataset_size: 31966848 - config_name: reasoning features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 51113482 num_examples: 78249 download_size: 51113482 dataset_size: 51113482 - config_name: retrieval features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 2341386536 num_examples: 1977667 download_size: 2341386536 dataset_size: 2341386536 - config_name: rewriting features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 6169877 num_examples: 29150 download_size: 6169877 dataset_size: 6169877 - config_name: structured features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 138260820 num_examples: 163542 download_size: 138260820 dataset_size: 138260820 - config_name: summarization features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 245532429 num_examples: 128669 download_size: 245532429 dataset_size: 245532429 - config_name: system_prompts features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 18434429 num_examples: 20512 download_size: 18434429 dataset_size: 18434429 - config_name: translation features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: token_count dtype: int64 - name: task_type dtype: string - name: instruct_score dtype: float64 - name: instruct_int_score dtype: int64 splits: - name: train num_bytes: 10878806 num_examples: 45204 download_size: 10878806 dataset_size: 10878806 configs: - config_name: code default: true data_files: - split: train path: code/train-* - config_name: function_call data_files: - split: train path: function_call/train-* - config_name: general data_files: - split: train path: general/train-* - config_name: math data_files: - split: train path: math/train-* - config_name: math_cot data_files: - split: train path: math_cot/train-* - config_name: reasoning data_files: - split: train path: reasoning/train-* - config_name: retrieval data_files: - split: train path: retrieval/train-* - config_name: rewriting data_files: - split: train path: rewriting/train-* - config_name: structured data_files: - split: train path: structured/train-* - config_name: summarization data_files: - split: train path: summarization/train-* - config_name: system_prompts data_files: - split: train path: system_prompts/train-* - config_name: translation data_files: - split: train path: translation/train-* license: apache-2.0 task_categories: - text-generation language: - pt tags: - Portuguese pretty_name: GigaVerbo-v2 SFT size_categories: - 1M<n<10M --- # GigaVerbo-v2 SFT: A Large-Scale Portuguese Instruction-Tuning Dataset <img src="./logo.png" height="200"> ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft - **Repository:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft - **Point of Contact:** [Polyg0t](mailto:polyglot@uni-bonn.de) ### Dataset Summary GigaVerbo-v2 SFT is a large-scale instruction-tuning dataset designed for supervised fine-tuning of language models in Portuguese. The dataset comprises approximately 2.1 billion tokens (~4.4 GB) across 4 million instruction-following examples, organized into 12 distinct task categories. It is entirely composed of high-quality, LLM-generated data that has been carefully curated and filtered to ensure instruction-following quality. The dataset is intended for developing language models capable of diverse downstream applications, from code generation and reasoning tasks to retrieval-augmented generation. ### Supported Tasks and Leaderboards This dataset can be utilized for the following tasks: - **Code Generation:** Instruction-guided code writing and synthesis. - **Function Calling:** API and structured command invocation. - **General Instruction Following:** Diverse tasks across multiple domains. - **Mathematical Reasoning:** Problem-solving with step-by-step solutions (CoT). - **Reasoning:** Samples with reasoning traces for complex problem-solving (`<think>`...`</think>` format). - **Retrieval-Augmented Generation:** Context-aware question-answering. - **Rewriting:** Paraphrasing and stylistic transformations. - **Structured Output Generation:** Producing formatted data in JSON formats. - **Summarization:** Extractive and abstractive summary generation. - **System Prompts:** Role-based instruction adherence. - **Translation:** Bilingual transfer between Portuguese and English. In general, it is a dataset for instruction-following supervised fine-tuning, and can be used to train models for a wide variety of downstream applications. ### Languages Portuguese. ## Dataset Structure ### Data Instances Each data instance in GigaVerbo-v2 SFT represents an instruction-following example in a conversational format. The dataset uses a standardized chat template adapted from Qwen3 series models, ensuring consistency across all task types and compatibility with modern conversational language model paradigms. ### Data Fields The dataset includes the following fields per example: - **messages:** a list of message objects, each containing: - **role:** the role of the message sender (e.g., "user", "assistant"). - **content:** the text content of the message. - **token_count:** the number of tokens in the example. - **task_type:** the category of the task (e.g., "code", "reasoning"). - **instruct_score:** a continuous score (1-5) indicating the quality of instruction-following, as assessed by a fine-tuned quality assessment model ([Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier)). - **instruct_int_score:** a discrete integer score (1-5) derived from the continuous instruct_score. ```json { "messages": [ { "role": "user", "content": "Qual a capital do Brasil?" }, { "role": "assistant", "content": "Brasília." } ], "token_count": 12, "task_type": "general", "instruct_score": 4.5, "instruct_int_score": 5 } ``` ### Subsets and Splits GigaVerbo-v2 SFT contains 12 task-based subsets. Each subset corresponds to a specific task category, allowing users to select data relevant to their needs. The dataset is provided as a single "train" split, as it is intended for supervised fine-tuning rather than evaluation. Users can create custom splits for validation and testing as needed. ```python from datasets import load_dataset # Load the full dataset ds = load_dataset("Polygl0t/gigaverbo-v2-sft", split="train") # Load specific task-based subset (e.g., reasoning tasks) ds_reasoning = load_dataset("Polygl0t/gigaverbo-v2-sft", "reasoning", split="train") # Streaming mode for limited bandwidth ds_streaming = load_dataset("Polygl0t/gigaverbo-v2-sft", split="train", streaming=True) ``` #### Statistics | Subset | Files | Rows | Size | Tokens | | --------------------------------- | ------ | ------------- | ----------- | ----------------- | | Code (`code`) | 1 | 80,774 | 0.15 GB | 84,389,567 | | Function Call (`function_call`) | 1 | 45,891 | 0.02 GB | 28,712,972 | | General (`general`) | 3 | 1,235,976 | 1.49 GB | 700,483,545 | | Math (`math`) | 1 | 220,042 | 0.09 GB | 83,234,218 | | Math CoT (`math_cot`) | 1 | 63,413 | 0.03 GB | 27,107,305 | | Reasoning (`reasoning`) | 1 | 78,249 | 0.05 GB | 34,786,173 | | Retrieval (`retrieval`) | 4 | 1,977,667 | 2.18 GB | 1,013,172,488 | | Rewriting (`rewriting`) | 1 | 29,150 | 0.01 GB | 3,674,384 | | Structured (`structured`) | 1 | 163,542 | 0.13 GB | 70,632,221 | | Summarization (`summarization`) | 1 | 128,669 | 0.23 GB | 90,310,108 | | System Prompts (`system_prompts`) | 1 | 20,512 | 0.02 GB | 7,261,615 | | Translation (`translation`) | 1 | 45,204 | 0.01 GB | 7,877,426 | | **Total** | **17** | **4,089,089** | **4.40 GB** | **2,151,642,022** | <details> <summary><b>Tokens per Subset</b></summary> **Tokens per Subset** ![Tokens per Subset](./.plots/gigaverbo_v2_sft_tokens_per_subset.png) </details> <details> <summary><b>Quality Score Distribution</b></summary> **Quality Score Distribution** ![Quality Score Distribution](./.plots/gigaverbo_v2_sft_instruction_scores.png) </details> ## Dataset Creation ### Curation Rationale To create GigaVerbo-v2 SFT, we began by defining a comprehensive set of tasks that would adequately prepare language models for various downstream applications. Our goal was to establish a diverse mix, allowing developers to either target specific skills as needed or develop models with hybrid proficiencies. This task taxonomy served as the foundation for our generation efforts. ### Source Data #### Data Collection Process The dataset construction process involved collecting an extensive set of prompts and instructions from multiple public sources, including existing instruction-tuning datasets and community contributions. These prompts originated in both English and Portuguese, and we standardized them by translating all English content to Portuguese using [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). This translation approach was carefully designed to preserve the original intent and context while maintaining fluency and naturalness in Portuguese. We paid particular attention to avoiding formatting inconsistencies that commonly arise during translation, such as problems with lists, code blocks, and LaTeX expressions. #### Synthetic Data Generation A significant portion of our dataset was generated synthetically using the Qwen2.5 series of models. The Qwen2.5 series was used to either directly generate instruction-following examples from the collected prompts or to augment existing examples with additional variations and reasoning traces. #### Reasoning-Intensive Tasks For reasoning-intensive tasks, we implemented a two-step generation process. First, we prompted the model to generate a high-quality response to the instruction. Then, we used a follow-up prompt to elicit a detailed reasoning trace that explained the model's thought process in arriving at the answer. This approach allowed us to create examples that not only provided correct answers but also included rich reasoning steps, which are crucial for training models on complex problem-solving tasks. #### Chat Template Standardization All of the data generated was formatted to fit a chat template we adapted from the Qwen3 series models. You can find the exact template used by our models ([Instruct](https://huggingface.co/Polygl0t/Tucano2-qwen-3.7B-Instruct/blob/main/chat_template.jinja) and [Think](https://huggingface.co/Polygl0t/Tucano2-qwen-3.7B-Think/blob/main/chat_template.jinja) variants) in the hyperlinks. ### Annotations #### Quality Assessment Framework Given that GigaVerbo-v2 SFT is entirely composed of LLM-generated data, we implemented a filtering mechanism to ensure quality and relevance. We employed an LLM-as-a-judge approach to design a custom evaluation prompt that instructed the model to assess how well the assistant's response adhered to the user's instruction. The prompt included detailed scoring criteria and required the model to provide a numerical score (1-5) along with a brief justification for its rating. #### Annotation Process We used Qwen2.5-32B-Instruct to evaluate 500,000 randomly sampled interactions from our dataset, generating an annotation dataset for instruction-following quality assessment. These annotations are available at Hugging Face under the [Polygl0t/portuguese-instruct-quality-qwen-annotations](https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations) repository, released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). #### Quality Assessment Models Using these annotations, we trained a dedicated quality assessment model that could efficiently filter the entire GigaVerbo-v2 SFT dataset. Given the complexity of this evaluation task, we fine-tuned [Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) into two separate models: **Classification Model (Quality Scoring):** - Task: Regression-based continuous score prediction (1-5) - Training: 2 epochs with batch size 64, context length 6,032 tokens - Optimizer: AdamW with cosine learning rate scheduler, 100 warmup steps - Learning rate: 5e-5 - Frozen layers: Embedding layer only - Performance: F1-macro score of 0.80 on validation set - At threshold ≥3 (acceptable quality): F1-score of 0.98 **Conditional Generation Model (Quality Explanation):** - Task: Generate JSON object with score and reasoning - Training: 3 epochs with batch size 64, context length 6,032 tokens - Optimizer: AdamW with cosine learning rate scheduler, 100 warmup steps - Learning rate: 5e-5 - Frozen layers: Embedding layer only - Custom chat template for standardized output format - Performance: F1-macro score of 0.72 on validation set **Model Selection:** The classification model outperformed the conditional generation model and was selected as the primary filter for GigaVerbo-v2 SFT. Samples receiving a quality score below 3 were filtered out. The conditional generation model remains available for generating explanations in future applications. Both fine-tuned quality assessment models are available at Hugging Face: - [Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier) - [Polygl0t/portuguese-qwen3-4b-instruct-quality-judge](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-judge) Both are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). #### Final Filtering and Decontamination We applied the same decontamination and language filtering procedures used in GigaVerbo-v2-Synth to ensure that the final dataset was free from benchmark contamination and samples where characters outside the permitted Latin unicode range were detected. We also applied several heuristic filters to remove examples deemed too low quality (e.g., samples that end in incomplete sentences, ill formatted code blocks, etc.). We also removed all samples that had a quality score below 3.5, as assessed by our quality filters. After all filtering steps, we were left with a final dataset of approximately 2.1 billion tokens across 4 million examples. ### Personal and Sensitive Information Since GigaVerbo-v2 SFT consists of synthetically generated instruction-following examples based on curated prompts, the dataset contains minimal personal and sensitive information. However, users should be aware that generated text may contain hypothetical scenarios or examples that reference sensitive topics. ## Considerations for Using the Data ### Social Impact of Dataset The creation of a large-scale Portuguese instruction-tuning dataset has significant potential to advance the development of instruction-following language models for Portuguese speakers. ## Additional Information ### Dataset Maintainers - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de) - [Shiza Fatimah](mailto:shizafatimah15@gmail.com) - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de) ### Licensing Information GigaVerbo-v2 SFT is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ```bibtex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!

提供机构：

Polygl0t

5,000+

优质数据集

54 个

任务类型

进入经典数据集