five

duarteocarmo/fineweb2-bagaco

收藏
Hugging Face2026-02-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/duarteocarmo/fineweb2-bagaco
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - por - pt license: odc-by task_categories: - text-generation - text-classification source_datasets: - HuggingFaceFW/fineweb-2 tags: - portuguese - web-corpus configs: - config_name: sample data_files: shard_00000.parquet default: true - config_name: all data_files: shard_*.parquet dataset_info: features: - name: text dtype: string - name: id dtype: string - name: dump dtype: string - name: url dtype: string - name: date dtype: string - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: language_script dtype: string - name: minhash_cluster_size dtype: int64 - name: top_langs dtype: string - name: category dtype: string pretty_name: Bagaço size_categories: - 10M<n<100M --- # Bagaço 🍷🇵🇹 Bagaço is a pretraining dataset for European Portuguese. It filters the [Fineweb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) dataset to URLs from Portuguese domains (e.g., `.pt/`). Each document is classified into one of 9 categories and scored for educational quality. ## Filtering - **Source**: `HuggingFaceFW/fineweb-2`, subset `por_Latn`, split `train` - **Filter**: URLs containing `.pt/` (Portuguese top-level domain) ## Document classification Each document is classified into one of 9 categories: Society, Arts, Business, Science, Sports, Lifestyle, Health, Games, News. **Labeling**: - **Model**: Gemini 2.5 Flash Lite - **Labeled samples**: 3,500 - **Prompt**: ``` System: Classify the Portuguese web text into one category. User: <first 800 chars of document> Response format: { category: Society | Arts | Business | Science | Sports | Lifestyle | Health | Games | News } ``` **Classifier**: - **Embeddings**: `intfloat/multilingual-e5-small` - **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`) - **Validation**: 20% held-out split (700 samples) | Category | Precision | Recall | F1-Score | Support | |---|---|---|---|---| | Arts | 0.60 | 0.83 | 0.70 | 59 | | Business | 0.81 | 0.78 | 0.79 | 131 | | Games | 0.75 | 0.91 | 0.82 | 23 | | Health | 0.77 | 0.87 | 0.81 | 53 | | Lifestyle | 0.81 | 0.75 | 0.78 | 111 | | News | 0.80 | 0.71 | 0.75 | 131 | | Science | 0.42 | 0.87 | 0.57 | 15 | | Society | 0.72 | 0.57 | 0.64 | 101 | | Sports | 0.89 | 0.87 | 0.88 | 76 | | **Accuracy** | | | **0.76** | **700** | | **Macro Avg** | **0.73** | **0.80** | **0.75** | **700** | | **Weighted Avg** | **0.77** | **0.76** | **0.76** | **700** | ## Educational score Each document is assigned an educational quality score (0-5). **Labeling**: - **Model**: Qwen3 235B A22B - **Labeled samples**: 30,000 - **Prompt** (adapted from [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)): ``` Below is an extract from a web page. Evaluate whether the page has a high educational value and could be useful in an educational setting for teaching from primary school to grade school levels using the additive 5-point scoring system described below. The text will be in Portuguese. Evaluate its educational value based on content quality, not language. - 1 point: basic information relevant to educational topics, even with ads/promotional material. - 2 points: addresses elements pertinent to education but doesn't align closely with standards. - 3 points: appropriate for educational use, introduces key concepts relevant to school curricula. - 4 points: highly relevant for grade school education, clear writing, substantial content. - 5 points: outstanding educational value, perfectly suited for primary/grade school teaching. The extract: <first 1500 chars of document> After examining the extract, briefly justify your total score (up to 100 words) and provide the educational score (0-5). Response format: { justification: str, educational_score: int } ``` **Classifier**: - **Embeddings**: `intfloat/multilingual-e5-small` - **Model**: Logistic Regression (`C=1.0`, `class_weight='balanced'`) - **Validation**: 20% held-out split (train=24,000, test=6,000) | Score | Precision | Recall | F1-Score | Support | |---|---|---|---|---| | 0 | 0.66 | 0.82 | 0.73 | 2,056 | | 1 | 0.83 | 0.53 | 0.65 | 3,438 | | 2 | 0.23 | 0.51 | 0.31 | 369 | | 3 | 0.21 | 0.60 | 0.31 | 131 | | 4 | 0.00 | 0.00 | 0.00 | 6 | | **Accuracy** | | | **0.63** | **6,000** | | **Macro Avg** | **0.39** | **0.49** | **0.40** | **6,000** | | **Weighted Avg** | **0.72** | **0.63** | **0.65** | **6,000** | ``` Confusion Matrix: [[1679 319 40 15 3] [ 847 1839 564 172 16] [ 7 65 189 101 7] [ 1 3 40 78 9] [ 0 0 3 3 0]] ``` ## Notes - `references` contains the labeled datasets used to train the classifiers - `scripts` contains the scripts used to train classifiers, classify documents, and process the dataset
提供机构:
duarteocarmo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作