AdaMLLab/AraMix-HQ
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AdaMLLab/AraMix-HQ
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: other
task_categories:
- text-generation
arxiv: 2512.18834
configs:
- config_name: default
data_files:
- split: train
path: "*.parquet"
---
<img src="https://huggingface.co/datasets/AdaMLLab/AraMix-HQ/resolve/main/finetasks_arabic_hq_comparison.png" width="900" alt="Finetasks benchmark scores comparing AraMix-HQ against AraMix-Matched and FineWeb2-HQ.">
<p align="center">
<a href="https://huggingface.co/collections/AdaMLLab/mixminmatch">
<img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection">
</a>
</p>
**AraMix family:** [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) (minhash and matched) | [AraMix-domain-classified](https://huggingface.co/datasets/AdaMLLab/AraMix-domain-classified) (with domain labels) | [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) (model-filtered)
AraMix-HQ is a high-quality subset of [AraMix-MinHash](https://huggingface.co/datasets/AdaMLLab/AraMix) created using model-based quality scoring. We adapt the approach from [FineWeb2-HQ](https://arxiv.org/abs/2502.10361) but replace the XLM-Roberta encoder with [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-small), which provides better Arabic language understanding. We release the model at [AdaMLLab/mmBERT-Arabic-Quality-Classifier](https://huggingface.co/AdaMLLab/mmBERT-Arabic-Quality-Classifier).
AraMix-HQ outperforms both AraMix-Matched and FineWeb2-HQ on Arabic FineTasks benchmarks.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("AdaMLLab/AraMix-HQ")
```
## Method
1. Start with AraMix-MinHash (178B tokens, 179M documents)
2. Score documents using mmBERT-based classifiers trained to identify structured, knowledge-rich content
3. Filter to retain high-scoring samples
## Citation
```bib
@misc{alrashed2025mixminmatch,
title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets},
author={Sultan Alrashed and Francesco Orabona},
year={2025},
eprint={2512.18834v2},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.18834v2},
}
```
## License
See individual source dataset licenses.
提供机构:
AdaMLLab



