Polygl0t/portuguese-instruct-quality-qwen-annotations
收藏Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: task_type
dtype: string
- name: token_count
dtype: int64
- name: score
dtype: int64
- name: reason
dtype: string
splits:
- name: train
num_bytes: 1598351703.0
num_examples: 500000
download_size: 862151302
dataset_size: 1598351703.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
task_categories:
- text-classification
language:
- pt
tags:
- instruct
- quality
- portuguese
pretty_name: Portuguese Instruct Quality Annotations
size_categories:
- 100K<n<1M
---
# Annotations for the Portuguese-instruct-quality classifier 📚
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Subsets and Splits](#subsets-and-splits)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Annotation Process](#annotation-process)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Additional Information](#additional-information)
- [Dataset Maintainers](#dataset-maintainers)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Acknowledgments](#acknowledgments)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations
- **Repository:** https://huggingface.co/datasets/Polygl0t/portuguese-instruct-quality-qwen-annotations
- **Point of Contact:** [Polyg0t](mailto:kluge@uni-bonn.de)
### Dataset Summary
This dataset contains the annotations used for training a quality filter for instruction type data ([Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-classifier) and [Polygl0t/portuguese-qwen3-4b-instruct-quality-judge](https://huggingface.co/Polygl0t/portuguese-qwen3-4b-instruct-quality-judge)).
These annotations were generated by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct).
### Supported Tasks and Leaderboards
This dataset can be used for the task of text classification, or for supervised fine-tuning.
### Languages
Portuguese.
## Dataset Structure
### Data Instances
- **id:** a unique identifier for each sample (md5 hash).
- **text:** a conversation between a user and an assistant, where the assistant's response is to be evaluated.
- **task_type:** the type of task the assistant is performing.
- **token_count:** the number of tokens in the document.
- **score:** the score assigned by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct).
- **reason:** the reason for the score assigned by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct).
### Data Fields
```json
{
"id": "a1b2c3d4e5f67890123456789abcdef",
"text": "Amostra de texto em português para avaliação.",
"task_type": "retrieval",
"token_count": 500,
"score": 2,
"reason": "A justificação para a pontuação atribuída pelo modelo Qwen."
}
```
### Subsets and Splits
This dataset contains a single split: `train`, which includes all 500,000 samples.
```python
from datasets import load_dataset
# Load the main dataset
ds = load_dataset("Polygl0t/portuguese-instruct-quality-qwen-annotations", split="train")
# If you don't want to download the entire dataset, set streaming to `True`
ds = load_dataset("Polygl0t/portuguese-instruct-quality-qwen-annotations", split="train", streaming=True)
```
## Dataset Creation
### Source Data
All data was sourced from [Polygl0t/gigaverbo-v2-sft](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft).
### Annotations
The dataset was created to provide high-quality annotations for Portuguese instruction-following data. The annotations were generated using the [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) model.
The score distribution of this dataset is the following:
| Quality Score | N samples |
| -------------- | --------- |
| 1 | 9404 |
| 2 | 16372 |
| 3 | 39181 |
| 4 | 127465 |
| 5 | 307578 |
#### Annotation Process
The initial annotation was performed on a random sample of 500,000 documents from [Polygl0t/gigaverbo-v2-sft](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2-sft), where we tried to sample equally from all the subsets that form this corpus. In terms of infrastructure, we used 4 x NVIDIA A100-SXM4-80GB GPUs and the vLLM library as our inference engine with 4-fold tensor parallelism.
Below is the prompt used for annotation:
```text
Você é um avaliador de respostas de assistente. Avalie, de 1 a 5, o quão bem o assistente seguiu a instrução do usuário em uma interação. Você DEVE responder em JSON.
Cada entrada contém a consulta do usuário (com ou sem prompt de sistema) e a resposta do assistente, geralmente em português (exceto traduções). Pode haver chamadas de ferramentas. Avalie o quão fiel e eficazmente a resposta atende à solicitação.
### Critérios:
1 — Resposta irrelevante, incoerente, imprópria (NSFW), prejudicial, muda de idioma sem motivo, ignora a instrução ou a solicitação é impossível/malformada.
2 — Tentou seguir, mas entendeu mal a intenção principal ou respondeu de forma incompleta/pouco útil.
3 — Cumpriu parcialmente, com erros, omissões ou falta de clareza/profundidade.
4 — Seguiu bem, com pequenas imprecisões ou omissões.
5 — Seguiu total e precisamente; resposta completa, correta e bem formatada.
### Regras adicionais:
- Ignore tags vazias (<think></think>).
- Se houver conteúdo dentro de <think>, avalie apenas o que vem depois.
- Avalie se chamadas de ferramentas foram apropriadas e se a resposta final é coerente.
- Considere se o assistente foi transparente sobre limitações.
- Se houver mudança de idioma, avalie se foi apropriada.
### Formato de saída:
Responda com um JSON válido contendo:
- "score": inteiro de 1 a 5
- "reason": breve justificativa da nota
Exemplo:
{'score': 2, 'reason': 'Resposta parcial que não atende completamente à solicitação.'}
---
O TEXTO A SER AVALIADO SERÁ INSERIDO AQUI.
---
Responda SOMENTE em JSON.
```
## Considerations for Using the Data
### Social Impact of Dataset
The dataset aims to provide a high-quality resource for training models to evaluate instruction-following capabilities in Portuguese. By leveraging a large language model for annotation, we aim to reduce human bias and ensure consistent evaluations. However, users should be aware of potential limitations, such as the inherent biases present in the model used for annotation and the challenges of evaluating nuanced human language.
## Additional Information
### Dataset Maintainers
- [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de).
- [Shiza Fatimah](mailto:shizafatimah15@gmail.com).
- [Aniket Sen](mailto:sen@hiskp.uni-bonn.de).
### Licensing Information
The dataset is licensed under the [Apache-2.0 License](LICENSE).
### Citation Information
```latex
@misc{correa2026tucano2cool,
title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}},
author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
year={2026},
eprint={2603.03543},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03543},
}
```
### Acknowledgments
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.
### Contributions
If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!
提供机构:
Polygl0t



