duarteocarmo/fineweb2-bagaco
收藏Hugging Face2026-02-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/duarteocarmo/fineweb2-bagaco
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- por
- pt
license: odc-by
task_categories:
- text-generation
- text-classification
source_datasets:
- HuggingFaceFW/fineweb-2
tags:
- portuguese
- web-corpus
configs:
- config_name: sample
data_files: shard_00000.parquet
default: true
- config_name: all
data_files: shard_*.parquet
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: language_script
dtype: string
- name: minhash_cluster_size
dtype: int64
- name: top_langs
dtype: string
- name: category
dtype: string
pretty_name: Bagaço
size_categories:
- 10M<n<100M
---
# Bagaço 🍷🇵🇹
Bagaço is a pretraining dataset for European Portuguese. It filters the [Fineweb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) dataset to URLs from Portuguese domains (e.g., `.pt/`). Each document is classified into one of 9 categories and scored for educational quality.
## Filtering
- **Source**: `HuggingFaceFW/fineweb-2`, subset `por_Latn`, split `train`
- **Filter**: URLs containing `.pt/` (Portuguese top-level domain)
## Document classification
Each document is classified into one of 9 categories: Society, Arts, Business, Science, Sports, Lifestyle, Health, Games, News.
**Labeling**:
- **Model**: Gemini 2.5 Flash Lite
- **Labeled samples**: 3,500
- **Prompt**:
```
System: Classify the Portuguese web text into one category.
User: <first 800 chars of document>
Response format: { category: Society | Arts | Business | Science | Sports | Lifestyle | Health | Games | News }
```
**Classifier**:
- **Embeddings**: `intfloat/multilingual-e5-small`
- **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`)
- **Validation**: 20% held-out split (700 samples)
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Arts | 0.60 | 0.83 | 0.70 | 59 |
| Business | 0.81 | 0.78 | 0.79 | 131 |
| Games | 0.75 | 0.91 | 0.82 | 23 |
| Health | 0.77 | 0.87 | 0.81 | 53 |
| Lifestyle | 0.81 | 0.75 | 0.78 | 111 |
| News | 0.80 | 0.71 | 0.75 | 131 |
| Science | 0.42 | 0.87 | 0.57 | 15 |
| Society | 0.72 | 0.57 | 0.64 | 101 |
| Sports | 0.89 | 0.87 | 0.88 | 76 |
| **Accuracy** | | | **0.76** | **700** |
| **Macro Avg** | **0.73** | **0.80** | **0.75** | **700** |
| **Weighted Avg** | **0.77** | **0.76** | **0.76** | **700** |
## Educational score
Each document is assigned an educational quality score (0-5).
**Labeling**:
- **Model**: Qwen3 235B A22B
- **Labeled samples**: 30,000
- **Prompt** (adapted from [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)):
```
Below is an extract from a web page. Evaluate whether the page has a high educational value
and could be useful in an educational setting for teaching from primary school to grade school
levels using the additive 5-point scoring system described below. The text will be in Portuguese.
Evaluate its educational value based on content quality, not language.
- 1 point: basic information relevant to educational topics, even with ads/promotional material.
- 2 points: addresses elements pertinent to education but doesn't align closely with standards.
- 3 points: appropriate for educational use, introduces key concepts relevant to school curricula.
- 4 points: highly relevant for grade school education, clear writing, substantial content.
- 5 points: outstanding educational value, perfectly suited for primary/grade school teaching.
The extract: <first 1500 chars of document>
After examining the extract, briefly justify your total score (up to 100 words)
and provide the educational score (0-5).
Response format: { justification: str, educational_score: int }
```
**Classifier**:
- **Embeddings**: `intfloat/multilingual-e5-small`
- **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`)
- **Validation**: 20% held-out split (train=24,000, test=6,000)
| Score | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.66 | 0.82 | 0.73 | 2,056 |
| 1 | 0.83 | 0.53 | 0.65 | 3,438 |
| 2 | 0.23 | 0.51 | 0.31 | 369 |
| 3 | 0.21 | 0.60 | 0.31 | 131 |
| 4 | 0.00 | 0.00 | 0.00 | 6 |
| **Accuracy** | | | **0.63** | **6,000** |
| **Macro Avg** | **0.39** | **0.49** | **0.40** | **6,000** |
| **Weighted Avg** | **0.72** | **0.63** | **0.65** | **6,000** |
```
Confusion Matrix:
[[1679 319 40 15 3]
[ 847 1839 564 172 16]
[ 7 65 189 101 7]
[ 1 3 40 78 9]
[ 0 0 3 3 0]]
```
## Notes
- `references` contains the labeled datasets used to train the classifiers
- `scripts` contains the scripts used to train classifiers, classify documents, and process the dataset
提供机构:
duarteocarmo



