five

safety-aya/fineweb-portuguese-100k

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/safety-aya/fineweb-portuguese-100k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pt license: apache-2.0 task_categories: - text-classification tags: - safety - content-moderation - portuguese - fineweb2 size_categories: - 100K<n<1M dataset_info: features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: language_script dtype: string - name: minhash_cluster_size dtype: int64 - name: top_langs dtype: string - name: classification struct: - name: safety_rating dtype: string - name: category sequence: string - name: reason dtype: string splits: - name: train num_examples: 100000 configs: - config_name: default data_files: - split: train path: fineweb2_por_100k_classified.jsonl --- # FineWeb2 Portuguese 100k - Safety Classified A 100,000-sample subset of [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb2) Portuguese web text, classified for content safety using [Cohere Command A](https://docs.cohere.com/docs/command-a). ## Dataset Description Each record contains the original FineWeb2 text and metadata, plus a `classification` field with: | Field | Description | |-------|-------------| | `safety_rating` | `"safe"` or `"unsafe"` | | `category` | List of applicable harm categories (null if safe) | | `reason` | Brief explanation of the classification | ## Safety Taxonomy The following 22-category taxonomy was used for classification: | Code | Category | |------|----------| | S1 | Violence | | S2 | Sexual | | S3 | Criminal Planning/Confessions | | S4 | Guns and Illegal Weapons | | S5 | Controlled/Regulated Substances | | S6 | Suicide and Self Harm | | S7 | Sexual (minor) | | S8 | Hate/Identity Hate | | S9 | PII/Privacy | | S10 | Harassment | | S11 | Threat | | S12 | Profanity | | S13 | Needs Caution | | S14 | Manipulation | | S15 | Fraud/Deception | | S16 | Malware | | S17 | High Risk Gov Decision Making | | S18 | Political/Misinformation/Conspiracy | | S19 | Copyright/Trademark/Plagiarism | | S20 | Unauthorized Advice | | S21 | Illegal Activity | | S22 | Immoral/Unethical | ## Statistics - **Total records**: 100,000 - **Safe**: 88,060 (88.06%) - **Unsafe**: 11,940 (11.94%) ### Top Unsafe Categories | Category | Count | |----------|-------| | Criminal Planning/Confessions (S3) | 3,744 | | Violence (S1) | 3,621 | | Illegal Activity (S21) | 2,845 | | Sexual (S2) | 2,293 | | Profanity (S12) | 1,732 | | Harassment (S10) | 1,680 | | Threat (S11) | 1,612 | | Political/Misinformation/Conspiracy (S18) | 1,329 | | Guns and Illegal Weapons (S4) | 1,144 | | Controlled/Regulated Substances (S5) | 1,135 | | Needs Caution (S13) | 1,037 | | Fraud/Deception (S15) | 993 | | Hate/Identity Hate (S8) | 793 | | Immoral/Unethical (S22) | 572 | | Sexual - minor (S7) | 492 | | Suicide and Self Harm (S6) | 462 | ## Classification Model - **Model**: Cohere Command A (`command-a-03-2025`) - **Method**: Structured JSON output with `response_format: json_object` - **Text limit**: 800,000 characters per document ## Source - **Base dataset**: [HuggingFaceFW/fineweb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb2) - **Language**: Portuguese (`por`) ## Usage ```python from datasets import load_dataset ds = load_dataset("YOUR_USERNAME/fineweb2-por-100k-safety", split="train") # Filter safe texts only safe = ds.filter(lambda x: x["classification"]["safety_rating"] == "safe") # Filter unsafe texts unsafe = ds.filter(lambda x: x["classification"]["safety_rating"] == "unsafe") ``` ## License Apache 2.0
提供机构:
safety-aya
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作