five

AdaMLLab/AraMix-domain-classified

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AdaMLLab/AraMix-domain-classified
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: minhash_deduped features: - name: text dtype: string - name: id dtype: string - name: source dtype: string - name: domain dtype: string splits: - name: train num_bytes: 841165473160 num_examples: 178883241 download_size: 403831795491 dataset_size: 841165473160 - config_name: sentence_deduped features: - name: text dtype: string - name: id dtype: string - name: source dtype: string - name: domain dtype: string splits: - name: train num_bytes: 768286568395 num_examples: 167571677 download_size: 365355252381 dataset_size: 768286568395 configs: - config_name: minhash_deduped data_files: - split: train path: minhash_deduped/train-* - config_name: sentence_deduped data_files: - split: train path: sentence_deduped/train-* language: - ar license: other task_categories: - text-generation arxiv: 2512.18834 --- # AraMix Domain-Classified <p align="center"> <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch"> <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection"> </a> </p> **AraMix family:** [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) (minhash and matched) | [AraMix-domain-classified](https://huggingface.co/datasets/AdaMLLab/AraMix-domain-classified) (with domain labels) | [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) (model-filtered) This is [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) with per-document domain labels from [`nvidia/multilingual-domain-classifier`](https://huggingface.co/nvidia/multilingual-domain-classifier). ## Usage ```python from datasets import load_dataset ds = load_dataset("AdaMLLab/AraMix-domain-classified", "minhash_deduped") ds = load_dataset("AdaMLLab/AraMix-domain-classified", "sentence_deduped") ``` ## Schema | Field | Description | |-------|-------------| | `text` | Document text | | `id` | Document ID | | `source` | Origin dataset (e.g., CulturaX, ArabicWeb24) | | `domain` | Predicted domain category | ## Domain Distribution (MinHash) | Domain | Tokens | % | |--------|--------|---| | People and Society | 26.4B | 14.8 | | News | 25.1B | 14.1 | | Business and Industrial | 19.1B | 10.7 | | Sensitive Subjects | 15.1B | 8.5 | | Health | 9.6B | 5.4 | | Finance | 8.6B | 4.8 | | Sports | 8.1B | 4.6 | | Arts and Entertainment | 6.7B | 3.8 | | Books and Literature | 6.5B | 3.7 | | Jobs and Education | 6.5B | 3.7 | | Other (16 categories) | 46.1B | 25.9 | Full distribution across all 26 categories available in the [paper](https://arxiv.org/abs/2512.18834). ## Citation ```bib @misc{alrashed2025mixminmatch, title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, author={Sultan Alrashed and Francesco Orabona}, year={2025}, eprint={2512.18834v2}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.18834v2}, } ``` ## License See individual source dataset licenses.
提供机构:
AdaMLLab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作