sapinsapin/halohalo
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sapinsapin/halohalo
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: dump
dtype: string
- name: file_path
dtype: string
- name: detected_lang
dtype: string
- name: word_count
dtype: int64
- name: title
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: token_count
dtype: int64
- name: content_hash
dtype: string
- name: crawled_at
dtype: string
splits:
- name: train
num_bytes: 167232715.0
num_examples: 41767
- name: test
num_bytes: 10732830.0
num_examples: 3769
download_size: 73712640
dataset_size: 177965545.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
# halohalo
## Dataset Summary
`halohalo` is a Pretraining text corpus for Philippine languages,
assembled from web-scraped data. It is compatible with [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) for LLM Pretraining.
## Source Data
Derived from the following cleaned datasets:
| Source | Documents |
|---|---|
| `halo-hil` | 8,874 |
| `halo-tgl` | 6,589 |
| `halo-bcl` | 1,264 |
Each source dataset was cleaned using `clean_halo.py` to remove web boilerplate, navigation menus,
markdown noise, HTML artifacts, and low-quality documents before being included here.
## Processing
1. **Cleaning** (`clean_halo.py`) — strips boilerplate, HTML, markdown noise; filters documents
with fewer than 30 words or less than 40% Latin characters
2. **FineWeb formatting** (`prep_halohalo.py`) — adds `source`, `language`, `token_count`,
`content_hash`; deduplicates against existing documents using MD5 content hashing
Processing code is available at [github.com/sapinsapin/halohalo](https://github.com/sapinsapin/halohalo).
## Statistics
| Metric | Value |
|---|---|
| Total documents | 16,727 |
| Total tokens | 19,178,582 |
| Avg tokens per document | 1,146.6 |
| Min tokens | 30 |
| Max tokens | 10,552 |
### Languages
| Language | Documents | Word Count |
|---|---|---|
| `hil` | 8,874 | 9,332,784 |
| `tgl` | 6,589 | 8,208,749 |
| `bcl` | 1,264 | 1,637,049 |
| **Total** | **16,727** | **19,178,582** |
## Schema
| Field | Type | Description |
|---|---|---|
| `text` | `str` | Cleaned document text |
| `id` | `str` | Unique document identifier |
| `source` | `str` | Source dataset name |
| `language` | `str` | ISO 639-3 language code |
| `token_count` | `int` | Whitespace-tokenized word count |
| `content_hash` | `str` | MD5 hash of text for deduplication |
| `url` | `str` | Source URL |
| `date` | `str` | Crawl date |
| `dump` | `str` | CommonCrawl dump identifier |
| `title` | `str` | Page title |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("sapinsapin/halohalo")
print(ds["train"][0])
```
提供机构:
sapinsapin



