five

AdaMLLab/AraMix-HQ

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AdaMLLab/AraMix-HQ
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: other task_categories: - text-generation arxiv: 2512.18834 configs: - config_name: default data_files: - split: train path: "*.parquet" --- <img src="https://huggingface.co/datasets/AdaMLLab/AraMix-HQ/resolve/main/finetasks_arabic_hq_comparison.png" width="900" alt="Finetasks benchmark scores comparing AraMix-HQ against AraMix-Matched and FineWeb2-HQ."> <p align="center"> <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch"> <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection"> </a> </p> **AraMix family:** [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) (minhash and matched) | [AraMix-domain-classified](https://huggingface.co/datasets/AdaMLLab/AraMix-domain-classified) (with domain labels) | [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) (model-filtered) AraMix-HQ is a high-quality subset of [AraMix-MinHash](https://huggingface.co/datasets/AdaMLLab/AraMix) created using model-based quality scoring. We adapt the approach from [FineWeb2-HQ](https://arxiv.org/abs/2502.10361) but replace the XLM-Roberta encoder with [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-small), which provides better Arabic language understanding. We release the model at [AdaMLLab/mmBERT-Arabic-Quality-Classifier](https://huggingface.co/AdaMLLab/mmBERT-Arabic-Quality-Classifier). AraMix-HQ outperforms both AraMix-Matched and FineWeb2-HQ on Arabic FineTasks benchmarks. ## Usage ```python from datasets import load_dataset ds = load_dataset("AdaMLLab/AraMix-HQ") ``` ## Method 1. Start with AraMix-MinHash (178B tokens, 179M documents) 2. Score documents using mmBERT-based classifiers trained to identify structured, knowledge-rich content 3. Filter to retain high-scoring samples ## Citation ```bib @misc{alrashed2025mixminmatch, title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, author={Sultan Alrashed and Francesco Orabona}, year={2025}, eprint={2512.18834v2}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.18834v2}, } ``` ## License See individual source dataset licenses.
提供机构:
AdaMLLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作