five

duarteocarmo/fineweb2-bagaco2

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/duarteocarmo/fineweb2-bagaco2
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Bagaço2 🍷🇵🇹 language: - pt - por license: mit task_categories: - text-classification task_ids: - language-modeling - topic-classification - text-scoring source_datasets: - uonlp/CulturaX tags: - portuguese - pt-pt - european-portuguese - web-corpus - culturax - classification - education - category-classification configs: - config_name: sample data_files: fineweb2-ptpt-prototype/000_00000.parquet default: true - config_name: all data_files: fineweb2-ptpt-prototype/*.parquet dataset_info: features: - name: text dtype: large_string - name: id dtype: large_string - name: dump dtype: large_string - name: url dtype: large_string - name: date dtype: large_string - name: file_path dtype: large_string - name: language dtype: large_string - name: language_score dtype: float64 - name: language_script dtype: large_string - name: minhash_cluster_size dtype: int64 - name: top_langs dtype: large_string - name: ptpt_score dtype: float64 - name: educational_score dtype: int8 - name: category dtype: large_string size_categories: - 10M<n<100M --- # Bagaço2 🍷🇵🇹 A pretraining dataset for European Portuguese. Methodology: - Takes Portuguese split of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) - Uses [this](https://huggingface.co/duarteocarmo/fasttext-euptvid) classifier to only keep PT-PT docs - Adds educational score + content category to each row, by running two additional classifiers **33M** documents in **460 parquet shards** (~ **37 GB**) ## Classifiers ### Statistics & Counts - CulturaX filter using datatrove: ``` "stats": { "total": 199737979, "dropped": 166679183, "forwarded": 33058796, "doc_len": { "total": 90594584215, "n": 33058796, "mean": 2740.4078543876713, "variance": 36957836.43497764, "std_dev": 6079.295718664921, "min": 209, "max": 668957 } ``` E.g., 33M docs passed the PT-PT test. (below are approximate, HF was ratelimiting me) | Educational score | Count | Percentage | |---|---:|---:| | 0 | ~14.5M | 45.20% | | 1 | ~11.0M | 34.34% | | 2 | ~4.4M | 13.70% | | 3 | ~1.9M | 6.04% | | 4 | ~0.2M | 0.72% | | Category | Count | Percentage | |---|---:|---:| | Lifestyle | ~5.9M | 18.46% | | News | ~5.0M | 15.47% | | Business | ~4.7M | 14.77% | | Society | ~4.0M | 12.51% | | Arts | ~3.8M | 11.84% | | Sports | ~3.0M | 9.33% | | Health | ~2.4M | 7.52% | | Games | ~2.0M | 6.27% | | Science | ~1.2M | 3.83% | ### Portuguese variety filter Filtering was done with [duarteocarmo/fasttext-euptvid](https://huggingface.co/duarteocarmo/fasttext-euptvid), a fastText classifier for Portuguese variety identification. - **Task**: PT-PT vs PT-BR classification - **Kept label**: `__label__PT_PT` - **Threshold**: `0.7` - **Model file**: `model_quantized.ftz` (quantized version) [See model card](https://huggingface.co/duarteocarmo/fasttext-euptvid). ### Educational score classifier Each document is assigned an educational quality score. **Reference data** - **Labeled samples**: 30,000 - **Labeling model**: Qwen3 235B A22B - **Reference file**: `classification/bagaco_reference_educational_score_0to5_qwen3_235b_30000.parquet` **Classifier** - **Embeddings**: `intfloat/multilingual-e5-small` - **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`) - **Validation**: 20% held-out split (`train=24,000`, `test=6,000`) #### Validation report | Score | Precision | Recall | F1-Score | Support | |---|---|---|---|---| | 0 | 0.66 | 0.82 | 0.73 | 2,056 | | 1 | 0.83 | 0.54 | 0.65 | 3,438 | | 2 | 0.23 | 0.51 | 0.31 | 369 | | 3 | 0.21 | 0.60 | 0.31 | 131 | | 4 | 0.00 | 0.00 | 0.00 | 6 | | **Accuracy** | | | **0.63** | **6,000** | | **Macro Avg** | **0.39** | **0.49** | **0.40** | **6,000** | | **Weighted Avg** | **0.72** | **0.63** | **0.65** | **6,000** | Confusion matrix: ```text [[1679 319 40 15 3] [ 846 1840 564 172 16] [ 7 65 189 101 7] [ 1 3 40 78 9] [ 0 0 3 3 0]] ``` ### Category classifier Each document is classified into one of 9 categories: `Society`, `Arts`, `Business`, `Science`, `Sports`, `Lifestyle`, `Health`, `Games`, and `News`. **Reference data** - **Labeled samples**: 3,500 - **Labeling model**: Gemini 2.5 Flash Lite - **Reference file**: `classification/bagaco_reference_category_9class_gemini25flashlite_3500.parquet` **Classifier** - **Embeddings**: `intfloat/multilingual-e5-small` - **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`) - **Validation**: 20% held-out split (`train=2,800`, `test=700`) #### Validation report | Category | Precision | Recall | F1-Score | Support | |---|---|---|---|---| | Arts | 0.60 | 0.83 | 0.70 | 59 | | Business | 0.81 | 0.78 | 0.79 | 131 | | Games | 0.75 | 0.91 | 0.82 | 23 | | Health | 0.77 | 0.87 | 0.81 | 53 | | Lifestyle | 0.81 | 0.75 | 0.78 | 111 | | News | 0.80 | 0.71 | 0.75 | 131 | | Science | 0.42 | 0.87 | 0.57 | 15 | | Society | 0.72 | 0.57 | 0.64 | 101 | | Sports | 0.89 | 0.87 | 0.88 | 76 | | **Accuracy** | | | **0.76** | **700** | | **Macro Avg** | **0.73** | **0.80** | **0.75** | **700** | | **Weighted Avg** | **0.77** | **0.76** | **0.76** | **700** |
提供机构:
duarteocarmo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作