AdaMLLab/AraMix-domain-classified
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AdaMLLab/AraMix-domain-classified
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: minhash_deduped
features:
- name: text
dtype: string
- name: id
dtype: string
- name: source
dtype: string
- name: domain
dtype: string
splits:
- name: train
num_bytes: 841165473160
num_examples: 178883241
download_size: 403831795491
dataset_size: 841165473160
- config_name: sentence_deduped
features:
- name: text
dtype: string
- name: id
dtype: string
- name: source
dtype: string
- name: domain
dtype: string
splits:
- name: train
num_bytes: 768286568395
num_examples: 167571677
download_size: 365355252381
dataset_size: 768286568395
configs:
- config_name: minhash_deduped
data_files:
- split: train
path: minhash_deduped/train-*
- config_name: sentence_deduped
data_files:
- split: train
path: sentence_deduped/train-*
language:
- ar
license: other
task_categories:
- text-generation
arxiv: 2512.18834
---
# AraMix Domain-Classified
<p align="center">
<a href="https://huggingface.co/collections/AdaMLLab/mixminmatch">
<img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection">
</a>
</p>
**AraMix family:** [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) (minhash and matched) | [AraMix-domain-classified](https://huggingface.co/datasets/AdaMLLab/AraMix-domain-classified) (with domain labels) | [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) (model-filtered)
This is [AraMix](https://huggingface.co/datasets/AdaMLLab/AraMix) with per-document domain labels from [`nvidia/multilingual-domain-classifier`](https://huggingface.co/nvidia/multilingual-domain-classifier).
## Usage
```python
from datasets import load_dataset
ds = load_dataset("AdaMLLab/AraMix-domain-classified", "minhash_deduped")
ds = load_dataset("AdaMLLab/AraMix-domain-classified", "sentence_deduped")
```
## Schema
| Field | Description |
|-------|-------------|
| `text` | Document text |
| `id` | Document ID |
| `source` | Origin dataset (e.g., CulturaX, ArabicWeb24) |
| `domain` | Predicted domain category |
## Domain Distribution (MinHash)
| Domain | Tokens | % |
|--------|--------|---|
| People and Society | 26.4B | 14.8 |
| News | 25.1B | 14.1 |
| Business and Industrial | 19.1B | 10.7 |
| Sensitive Subjects | 15.1B | 8.5 |
| Health | 9.6B | 5.4 |
| Finance | 8.6B | 4.8 |
| Sports | 8.1B | 4.6 |
| Arts and Entertainment | 6.7B | 3.8 |
| Books and Literature | 6.5B | 3.7 |
| Jobs and Education | 6.5B | 3.7 |
| Other (16 categories) | 46.1B | 25.9 |
Full distribution across all 26 categories available in the [paper](https://arxiv.org/abs/2512.18834).
## Citation
```bib
@misc{alrashed2025mixminmatch,
title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets},
author={Sultan Alrashed and Francesco Orabona},
year={2025},
eprint={2512.18834v2},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.18834v2},
}
```
## License
See individual source dataset licenses.
提供机构:
AdaMLLab



