Polygl0t/portuguese-toxicity-qwen-annotations

Name: Polygl0t/portuguese-toxicity-qwen-annotations
Creator: Polygl0t
Published: 2026-03-05 08:55:46
License: 暂无描述

Hugging Face2026-03-05 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/portuguese-toxicity-qwen-annotations

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: score dtype: int64 - name: subset dtype: string - name: source dtype: string splits: - name: train num_bytes: 1296242732 num_examples: 700000 download_size: 795304270 dataset_size: 1296242732 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 task_categories: - text-classification language: - pt tags: - toxicity - portuguese pretty_name: Portuguese Toxicity Annotations size_categories: - 100K<n<1M --- # Annotations for the Portuguese-Toxicity classifier 📚 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Subsets and Splits](#subsets-and-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Annotation Process](#annotation-process) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Additional Information](#additional-information) - [Dataset Maintainers](#dataset-maintainers) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Acknowledgments](#acknowledgments) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/Polygl0t/portuguese-toxicity-qwen-annotations - **Repository:** https://huggingface.co/datasets/Polygl0t/portuguese-toxicity-qwen-annotations - **Point of Contact:** [Polyg0t](mailto:kluge@uni-bonn.de) ### Dataset Summary This dataset contains the annotations used for training a toxicity classifier ([Polygl0t/portuguese-bertabaporu-large-toxicity-classifier](https://huggingface.co/Polygl0t/portuguese-bertabaporu-large-toxicity-classifier) and [Polygl0t/portuguese-bertimbau-toxicity-classifier](https://huggingface.co/Polygl0t/portuguese-bertimbau-toxicity-classifier)). These annotations were generated by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). ### Supported Tasks and Leaderboards This dataset can be used for the task of text classification, specifically for toxicity detection in Portuguese text. ### Languages Portuguese. ## Dataset Structure ### Data Instances - **id:** a unique identifier for each sample (md5 hash). - **text:** a string of text in Portuguese. - **source:** the source where that string originated. - **subset:** a short string indicating the name of the subset (referring to the original dataset or crawl). - **score:** the score assigned by [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). ### Data Fields ```json { "text": "Amostra de texto em português para avaliação.", "source": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-30/index.html", "subset": "CC-MAIN-2025-30", "id": "a1b2c3d4e5f67890123456789abcdef", "score": 2 } ``` ### Subsets and Splits This dataset contains a single split: `train`, which includes all 700,000 samples. ```python from datasets import load_dataset # Load the main dataset ds = load_dataset("Polygl0t/portuguese-toxicity-qwen-annotations", split="train") # If you don't want to download the entire dataset, set streaming to `True` ds = load_dataset("Polygl0t/portuguese-toxicity-qwen-annotations", split="train", streaming=True) ``` ## Dataset Creation ### Source Data All data was sourced from [Polygl0t/gigaverbo-v2](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2). The **id** field corresponds to the md5 hash of the text. ### Annotations The dataset was created to provide high-quality toxicity annotations for Portuguese text, leveraging the capabilities of Qwen2.5 to generate consistent and scalable annotations. The score distribution of this dataset is the following: | Toxicity Score | N samples | |----------------|-----------| | 1 | 605937 | | 2 | 50381 | | 3 | 26695 | | 4 | 11198 | | 5 | 5789 | #### Annotation Process The initial annotation was performed on a random sample of 700,000 documents from [Polygl0t/gigaverbo-v2](https://huggingface.co/datasets/Polygl0t/gigaverbo-v2), where we tried to sample equally from all the subsets that form this corpus. In terms of infrastructure, we used 4 x NVIDIA A100-SXM4-80GB GPUs and the vLLM library as our inference engine with 4-fold tensor parallelism. Below is the prompt used for annotation: ```text Você é um filtro de toxicidade para a língua portuguesa. Sua tarefa é avaliar, em uma escala de 1 a 5, o quão tóxico (ou seja, nocivo, ofensivo ou inapropriado) é um determinado trecho de texto. Você DEVE fornecer sua pontuação em formato JSON. ### Critérios de Avaliação: - Dê 1 se o texto não for tóxico, for neutro e apropriado para todos os públicos. O texto não contém conteúdo ofensivo, prejudicial ou inflamatório. - Dê 2 se o texto apresentar toxicidade leve, como linguagem levemente inapropriada, leve grosseria ou sentimento negativo sutil, mas sem ser explicitamente ofensivo ou prejudicial. - Dê 3 se o texto contiver toxicidade moderada, incluindo alguns insultos, linguagem ofensiva ou frases inadequadas, mas sem ser altamente agressivo ou gravemente nocivo. - Dê 4 se o texto for altamente tóxico, contendo discurso de ódio explícito, ameaças diretas, insultos severos ou estereótipos prejudiciais. Pode também incluir linguagem depreciativa ou agressiva direcionada a indivíduos ou grupos. - Dê 5 se o texto for extremamente tóxico, exibindo o mais alto nível de conteúdo nocivo, como incitação explícita à violência, discriminação grave, ameaças ou linguagem extremamente ofensiva. Essa pontuação é reservada para os textos mais perigosos e abusivos. ### Requisitos de Saída: - Sua saída deve ser um objeto JSON válido. - O JSON deve conter apenas um par chave-valor. - A chave deve ser 'score'. - O valor deve ser um número inteiro entre 1 e 5. - O valor deve representar fielmente os critérios acima. ### Exemplo de Saída: {'score': 2} Certifique-se de que suas respostas sigam estritamente este formato e avaliem o texto com precisão com base nos critérios fornecidos. Abaixo está um trecho retirado de uma página da web. Avalie se o trecho contém linguagem nociva, ofensiva ou inapropriada. Atribua uma pontuação entre 1 e 5. Certifique-se de que sua pontuação reflita o nível de toxicidade presente no texto. O texto será fornecido em português. Texto: <<O texto a ser avaliado>> Forneça sua pontuação como um número inteiro entre 1 e 5. Responda apenas em formato JSON. ``` ## Considerations for Using the Data ### Social Impact of Dataset The dataset aims to provide a valuable resource for understanding and mitigating toxic language in Bengali text. By offering high-quality annotations, it can help researchers and developers create more effective moderation tools, promote healthier online interactions, and contribute to the broader field of NLP for low-resource languages. However, it is important to understand that this dataset inevitably contains samples that may be offensive or harmful. ## Additional Information ### Dataset Maintainers - [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). - [Shiza Fatimah](mailto:shizafatimah15@gmail.com). - [Aniket Sen](mailto:sen@hiskp.uni-bonn.de). ### Licensing Information The dataset is licensed under the [Apache-2.0 License](LICENSE). ### Citation Information ```latex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ### Contributions If you want to contribute, contact us at [polyglot@uni-bonn.de](mailto:polyglot@uni-bonn.de)!

提供机构：

Polygl0t

5,000+

优质数据集

54 个

任务类型

进入经典数据集