Polygl0t/portuguese-instruct-quality-qwen-annotations

Name: Polygl0t/portuguese-instruct-quality-qwen-annotations
Creator: Polygl0t
Published: 2026-03-05 08:56:11
License: 暂无描述

Hugging Face2026-03-05 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: task_type dtype: string - name: token_count dtype: int64 - name: score dtype: int64 - name: reason dtype: string splits: - name: train num_bytes: 1598351703.0 num_examples: 500000 download_size: 862151302 dataset_size: 1598351703.0 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 task_categories: - text-classification language: - pt tags: - instruct - quality - portuguese pretty_name: Portuguese Instruct Quality Annotations size_categories: - 100K<n<1M --- # Annotations for the Portuguese-instruct-quality classifier 📚 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Annotation Process](#annotation-process) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations - **Repository:** https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations - **Point of Contact:** [Polyg0t](mailto:kluge@uni-bonn.de) ### Dataset Summary This dataset contains the annotations used for training a quality filter for instruction type data ([Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier) and [Polygl0t/portuguese-qwen3-4b-instruct-quality-judge](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-judge)). These annotations were generated by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). ### Supported Tasks and Leaderboards This dataset can be used for the task of text classification, or for supervised fine-tuning. ### Languages Portuguese. ## Dataset Structure ### Data Instances - **id:** a unique identifier for each sample (md5 hash). - **text:** a conversation between a user and an assistant, where the assistant's response is to be evaluated. - **task_type:** the type of task the assistant is performing. - **token_count:** the number of tokens in the document. - **score:** the score assigned by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). - **reason:** the reason for the score assigned by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). ### Data Fields ```json { "id": "a1b2c3d4e5f67890123456789abcdef", "text": "Amostra de texto em português para avaliação.", "task_type": "retrieval", "token_count": 500, "score": 2, "reason": "A justificação para a pontuação atribuída pelo modelo Qwen." } ``` ### Subsets and Splits This dataset contains a single split: `train`, which includes all 500,000 samples. ```python from datasets import load_dataset # Load the main dataset ds = load_dataset("Polygl0t/portuguese-instruct-quality-qwen-annotations", split="train") # If you don't want to download the entire dataset, set streaming to `True` ds = load_dataset("Polygl0t/portuguese-instruct-quality-qwen-annotations", split="train", streaming=True) ``` ## Dataset Creation ### Source Data All data was sourced from [Polygl0t/gigaverbo-v2-sft](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft). ### Annotations The dataset was created to provide high-quality annotations for Portuguese instruction-following data. The annotations were generated using the [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) model. The score distribution of this dataset is the following: | Quality Score | N samples | | -------------- | --------- | | 1 | 9404 | | 2 | 16372 | | 3 | 39181 | | 4 | 127465 | | 5 | 307578 | #### Annotation Process The initial annotation was performed on a random sample of 500,000 documents from [Polygl0t/gigaverbo-v2-sft](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft), where we tried to sample equally from all the subsets that form this corpus. In terms of infrastructure, we used 4 x NVIDIA A100-SXM4-80GB GPUs and the vLLM library as our inference engine with 4-fold tensor parallelism. Below is the prompt used for annotation: ```text Você é um avaliador de respostas de assistente. Avalie, de 1 a 5, o quão bem o assistente seguiu a instrução do usuário em uma interação. Você DEVE responder em JSON. Cada entrada contém a consulta do usuário (com ou sem prompt de sistema) e a resposta do assistente, geralmente em português (exceto traduções). Pode haver chamadas de ferramentas. Avalie o quão fiel e eficazmente a resposta atende à solicitação. ### Critérios: 1 — Resposta irrelevante, incoerente, imprópria (NSFW), prejudicial, muda de idioma sem motivo, ignora a instrução ou a solicitação é impossível/malformada. 2 — Tentou seguir, mas entendeu mal a intenção principal ou respondeu de forma incompleta/pouco útil. 3 — Cumpriu parcialmente, com erros, omissões ou falta de clareza/profundidade. 4 — Seguiu bem, com pequenas imprecisões ou omissões. 5 — Seguiu total e precisamente; resposta completa, correta e bem formatada. ### Regras adicionais: - Ignore tags vazias (<think></think>). - Se houver conteúdo dentro de <think>, avalie apenas o que vem depois. - Avalie se chamadas de ferramentas foram apropriadas e se a resposta final é coerente. - Considere se o assistente foi transparente sobre limitações. - Se houver mudança de idioma, avalie se foi apropriada. ### Formato de saída: Responda com um JSON válido contendo: - "score": inteiro de 1 a 5 - "reason": breve justificativa da nota Exemplo: {'score': 2, 'reason': 'Resposta parcial que não atende completamente à solicitação.'} --- O TEXTO A SER AVALIADO SERÁ INSERIDO AQUI. --- Responda SOMENTE em JSON. ``` ## Considerations for Using the Data ### Social Impact of Dataset The dataset aims to provide a high-quality resource for training models to evaluate instruction-following capabilities in Portuguese. By leveraging a large language model for annotation, we aim to reduce human bias and ensure consistent evaluations. However, users should be aware of potential limitations, such as the inherent biases present in the model used for annotation and the challenges of evaluating nuanced human language. ## Additional Information ### Dataset Maintainers - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). - [Shiza Fatimah](mailto:shizafatimah15@gmail.com). - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de). ### Licensing Information The dataset is licensed under the [Apache-2.0 License](LICENSE). ### Citation Information ```latex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!

提供机构：

Polygl0t

5,000+

优质数据集

54 个

任务类型

进入经典数据集