Polygl0t/gigaverbo-v2-preferences

Name: Polygl0t/gigaverbo-v2-preferences
Creator: Polygl0t
Published: 2026-03-05 08:52:43
License: 暂无描述

Hugging Face2026-03-05 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/gigaverbo-v2-preferences

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: harmfull-no-reasoning features: - name: prompt list: - name: content dtype: string - name: role dtype: string - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string - name: chosen_token_count dtype: int64 - name: rejected_token_count dtype: int64 splits: - name: train num_bytes: 8928692 num_examples: 4267 download_size: 8928692 dataset_size: 8928692 - config_name: harmfull-reasoning features: - name: prompt list: - name: content dtype: string - name: role dtype: string - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string - name: chosen_token_count dtype: int64 - name: rejected_token_count dtype: int64 splits: - name: train num_bytes: 12236498 num_examples: 4008 download_size: 12236498 dataset_size: 12236498 - config_name: harmless-no-reasoning features: - name: prompt list: - name: content dtype: string - name: role dtype: string - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string - name: chosen_token_count dtype: int64 - name: rejected_token_count dtype: int64 splits: - name: train num_bytes: 21592169 num_examples: 10521 download_size: 21592169 dataset_size: 21592169 - config_name: harmless-reasoning features: - name: prompt list: - name: content dtype: string - name: role dtype: string - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string - name: chosen_token_count dtype: int64 - name: rejected_token_count dtype: int64 splits: - name: train num_bytes: 26074353 num_examples: 9641 download_size: 26074353 dataset_size: 26074353 configs: - config_name: harmfull-no-reasoning data_files: - split: train path: harmfull-no-reasoning/train-* - config_name: harmfull-reasoning data_files: - split: train path: harmfull-reasoning/train-* - config_name: harmless-no-reasoning default: true data_files: - split: train path: harmless-no-reasoning/train-* - config_name: harmless-reasoning data_files: - split: train path: harmless-reasoning/train-* license: apache-2.0 task_categories: - text-generation language: - pt tags: - Portuguese pretty_name: GigaVerbo-v2 Preferences size_categories: - 100K<n<1M --- # GigaVerbo-v2 Preferences: A Hybrid-Reasoning Portuguese Preference Dataset <img src="./logo.png" height="200"> ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-preferences - **Repository:** https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-preferences - **Point of Contact:** [Polyg0t](mailto:polyglot@uni-bonn.de) ### Dataset Summary GigaVerbo-v2 Preferences is a preference dataset designed for Direct Preference Optimization (DPO) and other direct alignment algorithms. The dataset comprises approximately 27.8 million tokens across 28,437 preference pairs, organized into 4 distinct subsets covering both quality-focused and safety-focused alignment. It is entirely composed of high-quality, LLM-generated data using Constitutional AI approaches, carefully designed to support the development of language models that are both helpful and safe. As a hybrid reasoning/non-reasoning dataset, GigaVerbo-v2 Preferences enables training models that can generate reasoning traces in Portuguese while also learning to refuse harmful instructions effectively. ### Supported Tasks and Leaderboards This dataset can be utilized for taks involving preference-based training, like different direct alignment algorithms (e.g., DPO and it's many variants). Samples can also be used for supervised fine-tuning (SFT) and reward model training. ### Languages Portuguese. ## Dataset Structure ### Data Instances Each data instance in GigaVerbo-v2 Preferences represents a preference pair consisting of a prompt and two alternative responses: a preferred (`chosen`) response and a less preferred (`rejected`) response. The dataset is structured to support direct preference optimization, where models learn to distinguish between high-quality and lower-quality outputs. ### Data Fields The dataset includes the following fields per preference pair: - **prompt:** the input instruction or question that prompted both responses. - **chosen:** the preferred response, representing the desired model behavior for a given prompt. - **rejected:** the less preferred response, representing the undesired or lower-quality model behavior. - **chosen_token_count:** the number of tokens in the chosen response. - **rejected_token_count:** the number of tokens in the rejected response. ```json { "prompt": [{ "role": "user", "content": "Qual a capital do Brasil?" }], "chosen": [ { "role": "assistant", "content": "A capital do Brasil é Brasília." } ], "rejected": [ { "role": "assistant", "content": "A capital do Brasil é Rio de Janeiro." } ], "chosen_token_count": 7, "rejected_token_count": 9 } ``` ### Subsets and Splits GigaVerbo-v2 Preferences contains 4 preference-based subsets designed to address different alignment objectives. The dataset is provided as a single "train" split for direct preference optimization training. Users can create custom splits for validation and testing as needed. ```python from datasets import load_dataset # Load the full dataset ds = load_dataset("Polygl0t/gigaverbo-v2-preferences", split="train") # Load specific subset (e.g., harmless reasoning preferences) ds_harmless_reasoning = load_dataset("Polygl0t/gigaverbo-v2-preferences", "harmless-reasoning", split="train") # Streaming mode for limited bandwidth ds_streaming = load_dataset("Polygl0t/gigaverbo-v2-preferences", split="train", streaming=True) ``` #### Statistics | Subset | Examples | Chosen Tokens | Rejected Tokens | Total Tokens | Avg Chosen | Avg Rejected | | --------------------- | ---------- | -------------- | --------------- | -------------- | ---------- | ------------ | | harmless-reasoning | 9,641 | 6,105,673 | 4,571,353 | 10,677,026 | 633 | 474 | | harmless-no-reasoning | 10,521 | 4,711,020 | 4,057,440 | 8,768,460 | 448 | 386 | | harmful-reasoning | 4,008 | 1,462,284 | 3,242,908 | 4,705,192 | 365 | 809 | | harmful-no-reasoning | 4,267 | 809,240 | 2,795,035 | 3,604,275 | 190 | 655 | | **Total** | **28,437** | **14,278,217** | **13,666,736** | **27,945,053** | - | - | <details> <summary><b>Tokens per Subset</b></summary> **Tokens per Subset** ![Tokens per Subset](./.plots/gigaverbo_v2_preferences_tokens_per_subset.png) </details> <details> <summary><b>Average Length Distribution</b></summary> **Average Length Distribution** ![Average Length Distribution](./.plots/gigaverbo_v2_preferences_chosen_vs_rejected.png) - More detailed histograms are available in the [.plots](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-preferences/tree/main/.plots) folder. </details> ## Dataset Creation ### Curation Rationale Aligning large language models with human preferences and safety guidelines is a critical step in their development. During the creation of the first Tucano series of models, our DPO stage relied on the preference dataset released by Corrêa ([2024](https://arxiv.org/abs/2406.11039)), which contains 35,000 [preference pairs in Portuguese](https://huggingface.co/datasets/nicholasKluge/reward-aira-dataset). While this dataset provided a starting point, we identified the need for better curation and greater diversity in the preference pairs, as well as for introducing a reasoning/non-reasoning balance to support the training of models capable of generating reasoning traces in Portuguese. Our design philosophy emphasizes both **quality-focused alignment** (helpfulness and reasoning) and **safety alignment** (refusal and risk mitigation) in a hybrid, reasoning/non-reasoning format. ### Source Data #### Prompt Collection The construction of GigaVerbo-v2 Preferences began by sourcing a diverse collection of prompts from open datasets, including: - **[UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)**: diverse instruction-following examples - **[HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA)**: adversarial and safety-critical prompts These prompts were selected to span the core alignment challenges faced by LLMs: 1. **Harmless (benign) requests**, which require high-quality, helpful, and well-reasoned answers. 2. **Harmful or unsafe requests**, which require consistent refusal behaviors aligned with safety principles. #### Response Generation To generate the corresponding preferred (`chosen`) and less preferred (`rejected`) responses, we employed a _Constitutional AI_ approach. Distinct constitutions and steering strategies were used for each subset: **Harmless Subset:** - Preferred (`chosen`) responses were generated using [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) with chain-of-thought reasoning. - Less preferred (`rejected`) responses were produced using [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct), also with step-by-step reasoning. - This approach yields contrastive pairs that capture meaningful quality differences while preserving instruction-following consistency. - The constitution used for both models is available [here](./REASONING_CONSTITUTION.md). **Harmful Subset:** - Preferred responses were generated using safety-oriented constitutions applied to [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct). - Rejected samples consist of compliant (and therefore undesired) harmful completions generated by an [abliterated version of Qwen2.5-32B-Instruct](https://huggingface.co/huihui-ai/Qwen2.5-32B-Instruct-abliterated). An abliterated model is a variant known to comply with harmful instructions, providing realistic negative examples. - Two distinct constitutions were used to steer the preferred and rejected responses, both available in the hyperlinks below: - Preferred responses constitution: [here](./HARMLESS_CONSTITUTION.md) - Rejected responses constitution: [here](./HARMFULL_CONSTITUTION.md) ### Annotations #### Quality Assessment To assess the quality and alignment of the generated preference pairs, we performed an automatic evaluation using an LLM-as-a-judge framework. We designed a custom evaluation prompt instructing Qwen2.5-32B-Instruct to assess each `chosen`–`rejected` pair with respect to their corresponding constitutions. Specifically, the model compared responses based on (i) adherence to constitutional principles, (ii) helpfulness, and (iii) reasoning quality. To control computational cost, we evaluated a random subset of 500 preference pairs from the filtered dataset. Across all evaluated samples, the `chosen` responses were consistently rated as preferable to their corresponding `rejected` counterparts, providing quantitative evidence of the dataset’s overall quality and alignment. In addition, we conducted a manual inspection of 100 randomly selected filtered interactions to qualitatively examine the effectiveness of our generation and filtering pipeline. Together, these automatic and human evaluations increase our confidence in the reliability and alignment of the resulting preference pairs. **Note.** Human judgments are inherently subjective and may vary across annotators and use cases. While we have taken multiple steps to ensure dataset quality, we encourage users to perform independent evaluations and consider the specific requirements of their downstream applications when using GigaVerbo-v2 Preferences. ### Personal and Sensitive Information Since GigaVerbo-v2 Preferences consists of synthetically generated preference pairs based on curated prompts, the dataset contains minimal personal and sensitive information. However, users should be aware that: - The harmful subset intentionally includes examples of unsafe or malicious requests to enable safety training. - Generated responses may contain hypothetical scenarios or examples that reference sensitive topics. - The dataset is provided with the explicit purpose of training models to refuse such requests. - Users should handle the dataset responsibly and ensure it is used only for legitimate safety research. ## Considerations for Using the Data ### Social Impact of Dataset The creation of a Portuguese preference dataset for direct preference optimization has significant potential to advance the development of aligned, helpful, and safe language models for Portuguese speakers. By providing both quality-focused and safety-focused preference pairs, this dataset enables developers to create models that balance helpfulness with robust safety guarantees. **Given the inclusion of potentially dangerous prompts and non-aligned outputs, the dataset must be handled with care. The harmful subset is provided solely for safety research, enabling developers to train models that refuse malicious instructions without degrading their helpfulness on benign queries.** ## Additional Information ### Dataset Maintainers - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de) - [Shiza Fatimah](mailto:shizafatimah15@gmail.com) - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de) ### Licensing Information GigaVerbo-v2 Preferences is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ```bibtex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!

提供机构：

Polygl0t

5,000+

优质数据集

54 个

任务类型

进入经典数据集