duarteocarmo/fineweb2-bagaco2
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/duarteocarmo/fineweb2-bagaco2
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Bagaço2 🍷🇵🇹
language:
- pt
- por
license: mit
task_categories:
- text-classification
task_ids:
- language-modeling
- topic-classification
- text-scoring
source_datasets:
- uonlp/CulturaX
tags:
- portuguese
- pt-pt
- european-portuguese
- web-corpus
- culturax
- classification
- education
- category-classification
configs:
- config_name: sample
data_files: fineweb2-ptpt-prototype/000_00000.parquet
default: true
- config_name: all
data_files: fineweb2-ptpt-prototype/*.parquet
dataset_info:
features:
- name: text
dtype: large_string
- name: id
dtype: large_string
- name: dump
dtype: large_string
- name: url
dtype: large_string
- name: date
dtype: large_string
- name: file_path
dtype: large_string
- name: language
dtype: large_string
- name: language_score
dtype: float64
- name: language_script
dtype: large_string
- name: minhash_cluster_size
dtype: int64
- name: top_langs
dtype: large_string
- name: ptpt_score
dtype: float64
- name: educational_score
dtype: int8
- name: category
dtype: large_string
size_categories:
- 10M<n<100M
---
# Bagaço2 🍷🇵🇹
A pretraining dataset for European Portuguese.
Methodology:
- Takes Portuguese split of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)
- Uses [this](https://huggingface.co/duarteocarmo/fasttext-euptvid) classifier to only keep PT-PT docs
- Adds educational score + content category to each row, by running two additional classifiers
**33M** documents in **460 parquet shards** (~ **37 GB**)
## Classifiers
### Statistics & Counts
- CulturaX filter using datatrove:
```
"stats": {
"total": 199737979,
"dropped": 166679183,
"forwarded": 33058796,
"doc_len": {
"total": 90594584215,
"n": 33058796,
"mean": 2740.4078543876713,
"variance": 36957836.43497764,
"std_dev": 6079.295718664921,
"min": 209,
"max": 668957
}
```
E.g., 33M docs passed the PT-PT test.
(below are approximate, HF was ratelimiting me)
| Educational score | Count | Percentage |
|---|---:|---:|
| 0 | ~14.5M | 45.20% |
| 1 | ~11.0M | 34.34% |
| 2 | ~4.4M | 13.70% |
| 3 | ~1.9M | 6.04% |
| 4 | ~0.2M | 0.72% |
| Category | Count | Percentage |
|---|---:|---:|
| Lifestyle | ~5.9M | 18.46% |
| News | ~5.0M | 15.47% |
| Business | ~4.7M | 14.77% |
| Society | ~4.0M | 12.51% |
| Arts | ~3.8M | 11.84% |
| Sports | ~3.0M | 9.33% |
| Health | ~2.4M | 7.52% |
| Games | ~2.0M | 6.27% |
| Science | ~1.2M | 3.83% |
### Portuguese variety filter
Filtering was done with [duarteocarmo/fasttext-euptvid](https://huggingface.co/duarteocarmo/fasttext-euptvid), a fastText classifier for Portuguese variety identification.
- **Task**: PT-PT vs PT-BR classification
- **Kept label**: `__label__PT_PT`
- **Threshold**: `0.7`
- **Model file**: `model_quantized.ftz` (quantized version)
[See model card](https://huggingface.co/duarteocarmo/fasttext-euptvid).
### Educational score classifier
Each document is assigned an educational quality score.
**Reference data**
- **Labeled samples**: 30,000
- **Labeling model**: Qwen3 235B A22B
- **Reference file**: `classification/bagaco_reference_educational_score_0to5_qwen3_235b_30000.parquet`
**Classifier**
- **Embeddings**: `intfloat/multilingual-e5-small`
- **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`)
- **Validation**: 20% held-out split (`train=24,000`, `test=6,000`)
#### Validation report
| Score | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.66 | 0.82 | 0.73 | 2,056 |
| 1 | 0.83 | 0.54 | 0.65 | 3,438 |
| 2 | 0.23 | 0.51 | 0.31 | 369 |
| 3 | 0.21 | 0.60 | 0.31 | 131 |
| 4 | 0.00 | 0.00 | 0.00 | 6 |
| **Accuracy** | | | **0.63** | **6,000** |
| **Macro Avg** | **0.39** | **0.49** | **0.40** | **6,000** |
| **Weighted Avg** | **0.72** | **0.63** | **0.65** | **6,000** |
Confusion matrix:
```text
[[1679 319 40 15 3]
[ 846 1840 564 172 16]
[ 7 65 189 101 7]
[ 1 3 40 78 9]
[ 0 0 3 3 0]]
```
### Category classifier
Each document is classified into one of 9 categories: `Society`, `Arts`, `Business`, `Science`, `Sports`, `Lifestyle`, `Health`, `Games`, and `News`.
**Reference data**
- **Labeled samples**: 3,500
- **Labeling model**: Gemini 2.5 Flash Lite
- **Reference file**: `classification/bagaco_reference_category_9class_gemini25flashlite_3500.parquet`
**Classifier**
- **Embeddings**: `intfloat/multilingual-e5-small`
- **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`)
- **Validation**: 20% held-out split (`train=2,800`, `test=700`)
#### Validation report
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Arts | 0.60 | 0.83 | 0.70 | 59 |
| Business | 0.81 | 0.78 | 0.79 | 131 |
| Games | 0.75 | 0.91 | 0.82 | 23 |
| Health | 0.77 | 0.87 | 0.81 | 53 |
| Lifestyle | 0.81 | 0.75 | 0.78 | 111 |
| News | 0.80 | 0.71 | 0.75 | 131 |
| Science | 0.42 | 0.87 | 0.57 | 15 |
| Society | 0.72 | 0.57 | 0.64 | 101 |
| Sports | 0.89 | 0.87 | 0.88 | 76 |
| **Accuracy** | | | **0.76** | **700** |
| **Macro Avg** | **0.73** | **0.80** | **0.75** | **700** |
| **Weighted Avg** | **0.77** | **0.76** | **0.76** | **700** |
提供机构:
duarteocarmo



