img-gemina/indonesian-corpus-2b-deepclean-indo4b
收藏Hugging Face2026-04-09 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/img-gemina/indonesian-corpus-2b-deepclean-indo4b
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- id
tags:
- indonesian
- corpus
- pretraining
size_categories:
- 10M<n<100M
---
# Indonesian Corpus (Deep-Clean Indo4B)
This dataset is a deep-cleaned Indonesian corpus built from Indo4B local files.
## Current published artifact
- `indonesian_corpus_1p8b_deepclean_indo4b.jsonl.gz`
- Total tokens: `1801814049`
- Total docs: `25553595`
> Note: The initial goal was 2B tokens, but with strict filtering the run reached EOF at ~1.8B tokens.
## JSONL schema
```json
{
"id": "indo4b_local_XXXXXXXXXX",
"text": "...",
"source": "...",
"token_count": 123,
"quality_score": 0.98
}
```
## Cleaning summary
- Unicode fix: `ftfy`
- Hard filters: length, digit ratio, whitespace ratio, URL count, duplicate-line ratio
- Exact dedup: `xxhash` + SQLite global set
- Blocklists: porn / gambling / high-risk illegal terms
提供机构:
img-gemina



