Carlos1411/Zelaihandi-R
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Carlos1411/Zelaihandi-R
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- eu
pretty_name: ZelaiHandiClean 🤠
task_categories:
- text-generation
size_categories:
- 100M<n<1B
---
## Dataset Summary
ZelaiHandi-R (R = Refined) is the definitive and heavily denoised version of the original ZelaiHandi corpus, designed to maximize training efficiency and signal density under limited computational resources, it is also a version augmented with books from booktegui and Wikipedia articles to support high-efficiency language-modeling experiments.
This version prioritizes quality over raw size. A substantial amount of noise has been aggressively removed, including duplicated fragments, malformed extractions, structural scraping artifacts, low-signal text, formatting corruption, and non-linguistic content.
The objective of this release is not incremental cleaning, but training efficiency maximization.
By drastically reducing noise, this dataset:
- Increases information density per token
- Reduces wasted gradient updates
- Improves convergence speed
- Enhances training stability
Is especially effective for small and mid-sized models
Is explicitly optimized for compute-constrained environments
When computational resources are limited (low VRAM, small parameter counts, limited training budget), data quality becomes critical.
This definitive version is engineered to extract maximum performance per training step.
For example, this are the stats for Ekaia subset:
| Metric | Value |
|-----------------------------------------|------------:|
| Initial characters (Ekaia subset) | 14,480,942 |
| Final characters (after cleaning) | 12,746,071 |
| Overall cleaned | 11.98 % |
| Extra spaces removed | 0.03 % |
| Blank lines removed | 15.47 % |
| Non-linguistic characters removed | 11.96 % |
The cleaning process in this definitive release goes beyond surface normalization and focuses on improving the statistical integrity of the corpus.
## Supported Tasks
- Causal language modeling
- Masked language modeling
- Next-sentence prediction
- Any downstream Basque NLP task (fine-tuning)
## Languages
- Basque (`eu`)
## Dataset Statistics
| Metric | Value |
|-----------------------------------------|------------:|
| Total words | 660 million |
| Disk size | 4.6 GB |
| Additional books scraped from Booktegui | 400 |
| Wikipedia articles added | +2,500 |
## Dataset Structure
Each example in the JSONL files has the following schema:
{
"id": "unique for each document",
"periodico": "source",
"lugar": "geographic focus of the source",
"dominio": "type of content (articles, news, books…)",
"texto": "cleaned high-quality text used for training",
"licencia": "document license"
}
There is no specific train-val distinction in the dataset, but I would just take the dataset, divide it in 100 chunks of text and use 1 of them for val to make sure the model is generalising well.
## Data Collection and Cleaning
1. **Original Source**
- ZelaiHandi dataset (Basque news, books, articles)
2. **Cleaning Steps**
- Removed extra whitespace and blank lines
- Normalized Unicode characters
- Stripped non-linguistic symbols (HTML tags, control characters)
3. **Augmentation**
- +400 books (_liburuak_) scraped from Booktegui
- +2,500 articles from the Basque Wikipedia (Wikipediabi)
- Wikipedia Berria and Legebiltzarra datasets were ingested in **three parts** to avoid interface issues
---
## Considerations for Use
- All text is raw; you may wish to tokenize or further normalize per your model’s requirements. I have my own basque tokenizer that I provide you in my github.
- Maintain consistent train/validation splits for reproducible benchmarks.
---
## License
Various Creative Commons licenses (CC-BY, CC-BY-SA).
See each JSONL record’s `"licencia"` field for details.
---
## Citation
If you use this dataset, please cite:
> Orai NLP Teknologiak (2025). *ZelaiHandi + Booktegui + Wikipediabi Basque Corpus*. CC-BY-SA.
---
## Acknowledgements
Special thanks to:
- San Vicente, Iñaki & Urbizu
- Gorka & Corral
- Ander & Beloki
- Zuhaitz & Saralegi
- Xabier
…for creating the original ZelaiHandi dataset, which served as the foundation for this cleaned and slightly expanded corpus.
Tokenizer: https://github.com/Carlos141100/txamp-tokenizer-v12
提供机构:
Carlos1411



