Polygl0t/gigaverbo-v2-sft
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/gigaverbo-v2-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: code
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 160138633
num_examples: 80774
download_size: 160138633
dataset_size: 160138633
- config_name: function_call
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 21139767
num_examples: 45891
download_size: 21139767
dataset_size: 21139767
- config_name: general
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 1602583173
num_examples: 1235976
download_size: 1602583173
dataset_size: 1602583173
- config_name: math
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 101517517
num_examples: 220042
download_size: 101517517
dataset_size: 101517517
- config_name: math_cot
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 31966848
num_examples: 63413
download_size: 31966848
dataset_size: 31966848
- config_name: reasoning
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 51113482
num_examples: 78249
download_size: 51113482
dataset_size: 51113482
- config_name: retrieval
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 2341386536
num_examples: 1977667
download_size: 2341386536
dataset_size: 2341386536
- config_name: rewriting
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 6169877
num_examples: 29150
download_size: 6169877
dataset_size: 6169877
- config_name: structured
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 138260820
num_examples: 163542
download_size: 138260820
dataset_size: 138260820
- config_name: summarization
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 245532429
num_examples: 128669
download_size: 245532429
dataset_size: 245532429
- config_name: system_prompts
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 18434429
num_examples: 20512
download_size: 18434429
dataset_size: 18434429
- config_name: translation
features:
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: token_count
dtype: int64
- name: task_type
dtype: string
- name: instruct_score
dtype: float64
- name: instruct_int_score
dtype: int64
splits:
- name: train
num_bytes: 10878806
num_examples: 45204
download_size: 10878806
dataset_size: 10878806
configs:
- config_name: code
default: true
data_files:
- split: train
path: code/train-*
- config_name: function_call
data_files:
- split: train
path: function_call/train-*
- config_name: general
data_files:
- split: train
path: general/train-*
- config_name: math
data_files:
- split: train
path: math/train-*
- config_name: math_cot
data_files:
- split: train
path: math_cot/train-*
- config_name: reasoning
data_files:
- split: train
path: reasoning/train-*
- config_name: retrieval
data_files:
- split: train
path: retrieval/train-*
- config_name: rewriting
data_files:
- split: train
path: rewriting/train-*
- config_name: structured
data_files:
- split: train
path: structured/train-*
- config_name: summarization
data_files:
- split: train
path: summarization/train-*
- config_name: system_prompts
data_files:
- split: train
path: system_prompts/train-*
- config_name: translation
data_files:
- split: train
path: translation/train-*
license: apache-2.0
task_categories:
- text-generation
language:
- pt
tags:
- Portuguese
pretty_name: GigaVerbo-v2 SFT
size_categories:
- 1M<n<10M
---
# GigaVerbo-v2 SFT: A Large-Scale Portuguese Instruction-Tuning Dataset
<img src="./logo.png" height="200">
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Subsets and Splits](#subsets-and-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Additional Information](#additional-information)
- [Dataset Maintainers](#dataset-maintainers)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Acknowledgments](#acknowledgments)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft
- **Repository:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft
- **Point of Contact:** [Polyg0t](mailto:polyglot@uni-bonn.de)
### Dataset Summary
GigaVerbo-v2 SFT is a large-scale instruction-tuning dataset designed for supervised fine-tuning of language models in Portuguese. The dataset comprises approximately 2.1 billion tokens (~4.4 GB) across 4 million instruction-following examples, organized into 12 distinct task categories. It is entirely composed of high-quality, LLM-generated data that has been carefully curated and filtered to ensure instruction-following quality. The dataset is intended for developing language models capable of diverse downstream applications, from code generation and reasoning tasks to retrieval-augmented generation.
### Supported Tasks and Leaderboards
This dataset can be utilized for the following tasks:
- **Code Generation:** Instruction-guided code writing and synthesis.
- **Function Calling:** API and structured command invocation.
- **General Instruction Following:** Diverse tasks across multiple domains.
- **Mathematical Reasoning:** Problem-solving with step-by-step solutions (CoT).
- **Reasoning:** Samples with reasoning traces for complex problem-solving (`<think>`...`</think>` format).
- **Retrieval-Augmented Generation:** Context-aware question-answering.
- **Rewriting:** Paraphrasing and stylistic transformations.
- **Structured Output Generation:** Producing formatted data in JSON formats.
- **Summarization:** Extractive and abstractive summary generation.
- **System Prompts:** Role-based instruction adherence.
- **Translation:** Bilingual transfer between Portuguese and English.
In general, it is a dataset for instruction-following supervised fine-tuning, and can be used to train models for a wide variety of downstream applications.
### Languages
Portuguese.
## Dataset Structure
### Data Instances
Each data instance in GigaVerbo-v2 SFT represents an instruction-following example in a conversational format. The dataset uses a standardized chat template adapted from Qwen3 series models, ensuring consistency across all task types and compatibility with modern conversational language model paradigms.
### Data Fields
The dataset includes the following fields per example:
- **messages:** a list of message objects, each containing:
- **role:** the role of the message sender (e.g., "user", "assistant").
- **content:** the text content of the message.
- **token_count:** the number of tokens in the example.
- **task_type:** the category of the task (e.g., "code", "reasoning").
- **instruct_score:** a continuous score (1-5) indicating the quality of instruction-following, as assessed by a fine-tuned quality assessment model ([Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier)).
- **instruct_int_score:** a discrete integer score (1-5) derived from the continuous instruct_score.
```json
{
"messages": [
{ "role": "user", "content": "Qual a capital do Brasil?" },
{ "role": "assistant", "content": "Brasília." }
],
"token_count": 12,
"task_type": "general",
"instruct_score": 4.5,
"instruct_int_score": 5
}
```
### Subsets and Splits
GigaVerbo-v2 SFT contains 12 task-based subsets. Each subset corresponds to a specific task category, allowing users to select data relevant to their needs. The dataset is provided as a single "train" split, as it is intended for supervised fine-tuning rather than evaluation. Users can create custom splits for validation and testing as needed.
```python
from datasets import load_dataset
# Load the full dataset
ds = load_dataset("Polygl0t/gigaverbo-v2-sft", split="train")
# Load specific task-based subset (e.g., reasoning tasks)
ds_reasoning = load_dataset("Polygl0t/gigaverbo-v2-sft", "reasoning", split="train")
# Streaming mode for limited bandwidth
ds_streaming = load_dataset("Polygl0t/gigaverbo-v2-sft", split="train", streaming=True)
```
#### Statistics
| Subset | Files | Rows | Size | Tokens |
| --------------------------------- | ------ | ------------- | ----------- | ----------------- |
| Code (`code`) | 1 | 80,774 | 0.15 GB | 84,389,567 |
| Function Call (`function_call`) | 1 | 45,891 | 0.02 GB | 28,712,972 |
| General (`general`) | 3 | 1,235,976 | 1.49 GB | 700,483,545 |
| Math (`math`) | 1 | 220,042 | 0.09 GB | 83,234,218 |
| Math CoT (`math_cot`) | 1 | 63,413 | 0.03 GB | 27,107,305 |
| Reasoning (`reasoning`) | 1 | 78,249 | 0.05 GB | 34,786,173 |
| Retrieval (`retrieval`) | 4 | 1,977,667 | 2.18 GB | 1,013,172,488 |
| Rewriting (`rewriting`) | 1 | 29,150 | 0.01 GB | 3,674,384 |
| Structured (`structured`) | 1 | 163,542 | 0.13 GB | 70,632,221 |
| Summarization (`summarization`) | 1 | 128,669 | 0.23 GB | 90,310,108 |
| System Prompts (`system_prompts`) | 1 | 20,512 | 0.02 GB | 7,261,615 |
| Translation (`translation`) | 1 | 45,204 | 0.01 GB | 7,877,426 |
| **Total** | **17** | **4,089,089** | **4.40 GB** | **2,151,642,022** |
<details>
<summary><b>Tokens per Subset</b></summary>
**Tokens per Subset**

</details>
<details>
<summary><b>Quality Score Distribution</b></summary>
**Quality Score Distribution**

</details>
## Dataset Creation
### Curation Rationale
To create GigaVerbo-v2 SFT, we began by defining a comprehensive set of tasks that would adequately prepare language models for various downstream applications. Our goal was to establish a diverse mix, allowing developers to either target specific skills as needed or develop models with hybrid proficiencies. This task taxonomy served as the foundation for our generation efforts.
### Source Data
#### Data Collection Process
The dataset construction process involved collecting an extensive set of prompts and instructions from multiple public sources, including existing instruction-tuning datasets and community contributions. These prompts originated in both English and Portuguese, and we standardized them by translating all English content to Portuguese using [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). This translation approach was carefully designed to preserve the original intent and context while maintaining fluency and naturalness in Portuguese. We paid particular attention to avoiding formatting inconsistencies that commonly arise during translation, such as problems with lists, code blocks, and LaTeX expressions.
#### Synthetic Data Generation
A significant portion of our dataset was generated synthetically using the Qwen2.5 series of models. The Qwen2.5 series was used to either directly generate instruction-following examples from the collected prompts or to augment existing examples with additional variations and reasoning traces.
#### Reasoning-Intensive Tasks
For reasoning-intensive tasks, we implemented a two-step generation process. First, we prompted the model to generate a high-quality response to the instruction. Then, we used a follow-up prompt to elicit a detailed reasoning trace that explained the model's thought process in arriving at the answer. This approach allowed us to create examples that not only provided correct answers but also included rich reasoning steps, which are crucial for training models on complex problem-solving tasks.
#### Chat Template Standardization
All of the data generated was formatted to fit a chat template we adapted from the Qwen3 series models. You can find the exact template used by our models ([Instruct](https://huggingface.co/Polygl0t/Tucano2-qwen-3.7B-Instruct/blob/main/chat_template.jinja) and [Think](https://huggingface.co/Polygl0t/Tucano2-qwen-3.7B-Think/blob/main/chat_template.jinja) variants) in the hyperlinks.
### Annotations
#### Quality Assessment Framework
Given that GigaVerbo-v2 SFT is entirely composed of LLM-generated data, we implemented a filtering mechanism to ensure quality and relevance. We employed an LLM-as-a-judge approach to design a custom evaluation prompt that instructed the model to assess how well the assistant's response adhered to the user's instruction. The prompt included detailed scoring criteria and required the model to provide a numerical score (1-5) along with a brief justification for its rating.
#### Annotation Process
We used Qwen2.5-32B-Instruct to evaluate 500,000 randomly sampled interactions from our dataset, generating an annotation dataset for instruction-following quality assessment. These annotations are available at Hugging Face under the [Polygl0t/portuguese-instruct-quality-qwen-annotations](https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations) repository, released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
#### Quality Assessment Models
Using these annotations, we trained a dedicated quality assessment model that could efficiently filter the entire GigaVerbo-v2 SFT dataset. Given the complexity of this evaluation task, we fine-tuned [Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) into two separate models:
**Classification Model (Quality Scoring):**
- Task: Regression-based continuous score prediction (1-5)
- Training: 2 epochs with batch size 64, context length 6,032 tokens
- Optimizer: AdamW with cosine learning rate scheduler, 100 warmup steps
- Learning rate: 5e-5
- Frozen layers: Embedding layer only
- Performance: F1-macro score of 0.80 on validation set
- At threshold ≥3 (acceptable quality): F1-score of 0.98
**Conditional Generation Model (Quality Explanation):**
- Task: Generate JSON object with score and reasoning
- Training: 3 epochs with batch size 64, context length 6,032 tokens
- Optimizer: AdamW with cosine learning rate scheduler, 100 warmup steps
- Learning rate: 5e-5
- Frozen layers: Embedding layer only
- Custom chat template for standardized output format
- Performance: F1-macro score of 0.72 on validation set
**Model Selection:**
The classification model outperformed the conditional generation model and was selected as the primary filter for GigaVerbo-v2 SFT. Samples receiving a quality score below 3 were filtered out. The conditional generation model remains available for generating explanations in future applications.
Both fine-tuned quality assessment models are available at Hugging Face:
- [Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier)
- [Polygl0t/portuguese-qwen3-4b-instruct-quality-judge](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-judge)
Both are released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
#### Final Filtering and Decontamination
We applied the same decontamination and language filtering procedures used in GigaVerbo-v2-Synth to ensure that the final dataset was free from benchmark contamination and samples where characters outside the permitted Latin unicode range were detected. We also applied several heuristic filters to remove examples deemed too low quality (e.g., samples that end in incomplete sentences, ill formatted code blocks, etc.). We also removed all samples that had a quality score below 3.5, as assessed by our quality filters. After all filtering steps, we were left with a final dataset of approximately 2.1 billion tokens across 4 million examples.
### Personal and Sensitive Information
Since GigaVerbo-v2 SFT consists of synthetically generated instruction-following examples based on curated prompts, the dataset contains minimal personal and sensitive information. However, users should be aware that generated text may contain hypothetical scenarios or examples that reference sensitive topics.
## Considerations for Using the Data
### Social Impact of Dataset
The creation of a large-scale Portuguese instruction-tuning dataset has significant potential to advance the development of instruction-following language models for Portuguese speakers.
## Additional Information
### Dataset Maintainers
- [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de)
- [Shiza Fatimah](mailto:shizafatimah15@gmail.com)
- [Aniket Sen](mailto:sen@hiskp.uni-bonn.de)
### Licensing Information
GigaVerbo-v2 SFT is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
```bibtex
@misc{correa2026tucano2cool,
title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}},
author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
year={2026},
eprint={2603.03543},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03543},
}
```
### Acknowledgments
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.
### Contributions
If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!
提供机构:
Polygl0t



