safety-aya/fineweb-portuguese-100k
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/safety-aya/fineweb-portuguese-100k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
license: apache-2.0
task_categories:
- text-classification
tags:
- safety
- content-moderation
- portuguese
- fineweb2
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: language_script
dtype: string
- name: minhash_cluster_size
dtype: int64
- name: top_langs
dtype: string
- name: classification
struct:
- name: safety_rating
dtype: string
- name: category
sequence: string
- name: reason
dtype: string
splits:
- name: train
num_examples: 100000
configs:
- config_name: default
data_files:
- split: train
path: fineweb2_por_100k_classified.jsonl
---
# FineWeb2 Portuguese 100k - Safety Classified
A 100,000-sample subset of [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb2) Portuguese web text, classified for content safety using [Cohere Command A](https://docs.cohere.com/docs/command-a).
## Dataset Description
Each record contains the original FineWeb2 text and metadata, plus a `classification` field with:
| Field | Description |
|-------|-------------|
| `safety_rating` | `"safe"` or `"unsafe"` |
| `category` | List of applicable harm categories (null if safe) |
| `reason` | Brief explanation of the classification |
## Safety Taxonomy
The following 22-category taxonomy was used for classification:
| Code | Category |
|------|----------|
| S1 | Violence |
| S2 | Sexual |
| S3 | Criminal Planning/Confessions |
| S4 | Guns and Illegal Weapons |
| S5 | Controlled/Regulated Substances |
| S6 | Suicide and Self Harm |
| S7 | Sexual (minor) |
| S8 | Hate/Identity Hate |
| S9 | PII/Privacy |
| S10 | Harassment |
| S11 | Threat |
| S12 | Profanity |
| S13 | Needs Caution |
| S14 | Manipulation |
| S15 | Fraud/Deception |
| S16 | Malware |
| S17 | High Risk Gov Decision Making |
| S18 | Political/Misinformation/Conspiracy |
| S19 | Copyright/Trademark/Plagiarism |
| S20 | Unauthorized Advice |
| S21 | Illegal Activity |
| S22 | Immoral/Unethical |
## Statistics
- **Total records**: 100,000
- **Safe**: 88,060 (88.06%)
- **Unsafe**: 11,940 (11.94%)
### Top Unsafe Categories
| Category | Count |
|----------|-------|
| Criminal Planning/Confessions (S3) | 3,744 |
| Violence (S1) | 3,621 |
| Illegal Activity (S21) | 2,845 |
| Sexual (S2) | 2,293 |
| Profanity (S12) | 1,732 |
| Harassment (S10) | 1,680 |
| Threat (S11) | 1,612 |
| Political/Misinformation/Conspiracy (S18) | 1,329 |
| Guns and Illegal Weapons (S4) | 1,144 |
| Controlled/Regulated Substances (S5) | 1,135 |
| Needs Caution (S13) | 1,037 |
| Fraud/Deception (S15) | 993 |
| Hate/Identity Hate (S8) | 793 |
| Immoral/Unethical (S22) | 572 |
| Sexual - minor (S7) | 492 |
| Suicide and Self Harm (S6) | 462 |
## Classification Model
- **Model**: Cohere Command A (`command-a-03-2025`)
- **Method**: Structured JSON output with `response_format: json_object`
- **Text limit**: 800,000 characters per document
## Source
- **Base dataset**: [HuggingFaceFW/fineweb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb2)
- **Language**: Portuguese (`por`)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("YOUR_USERNAME/fineweb2-por-100k-safety", split="train")
# Filter safe texts only
safe = ds.filter(lambda x: x["classification"]["safety_rating"] == "safe")
# Filter unsafe texts
unsafe = ds.filter(lambda x: x["classification"]["safety_rating"] == "unsafe")
```
## License
Apache 2.0
提供机构:
safety-aya



