five

sapinsapin/halohalo

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sapinsapin/halohalo
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: url dtype: string - name: date dtype: string - name: dump dtype: string - name: file_path dtype: string - name: detected_lang dtype: string - name: word_count dtype: int64 - name: title dtype: string - name: source dtype: string - name: language dtype: string - name: token_count dtype: int64 - name: content_hash dtype: string - name: crawled_at dtype: string splits: - name: train num_bytes: 167232715.0 num_examples: 41767 - name: test num_bytes: 10732830.0 num_examples: 3769 download_size: 73712640 dataset_size: 177965545.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # halohalo ## Dataset Summary `halohalo` is a Pretraining text corpus for Philippine languages, assembled from web-scraped data. It is compatible with [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) for LLM Pretraining. ## Source Data Derived from the following cleaned datasets: | Source | Documents | |---|---| | `halo-hil` | 8,874 | | `halo-tgl` | 6,589 | | `halo-bcl` | 1,264 | Each source dataset was cleaned using `clean_halo.py` to remove web boilerplate, navigation menus, markdown noise, HTML artifacts, and low-quality documents before being included here. ## Processing 1. **Cleaning** (`clean_halo.py`) — strips boilerplate, HTML, markdown noise; filters documents with fewer than 30 words or less than 40% Latin characters 2. **FineWeb formatting** (`prep_halohalo.py`) — adds `source`, `language`, `token_count`, `content_hash`; deduplicates against existing documents using MD5 content hashing Processing code is available at [github.com/sapinsapin/halohalo](https://github.com/sapinsapin/halohalo). ## Statistics | Metric | Value | |---|---| | Total documents | 16,727 | | Total tokens | 19,178,582 | | Avg tokens per document | 1,146.6 | | Min tokens | 30 | | Max tokens | 10,552 | ### Languages | Language | Documents | Word Count | |---|---|---| | `hil` | 8,874 | 9,332,784 | | `tgl` | 6,589 | 8,208,749 | | `bcl` | 1,264 | 1,637,049 | | **Total** | **16,727** | **19,178,582** | ## Schema | Field | Type | Description | |---|---|---| | `text` | `str` | Cleaned document text | | `id` | `str` | Unique document identifier | | `source` | `str` | Source dataset name | | `language` | `str` | ISO 639-3 language code | | `token_count` | `int` | Whitespace-tokenized word count | | `content_hash` | `str` | MD5 hash of text for deduplication | | `url` | `str` | Source URL | | `date` | `str` | Crawl date | | `dump` | `str` | CommonCrawl dump identifier | | `title` | `str` | Page title | ## Usage ```python from datasets import load_dataset ds = load_dataset("sapinsapin/halohalo") print(ds["train"][0]) ```
提供机构:
sapinsapin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作