almaghrabima/tokenizer-data

Name: almaghrabima/tokenizer-data
Creator: almaghrabima
Published: 2026-04-19 21:41:50
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/almaghrabima/tokenizer-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - ar pretty_name: DeepLatent bilingual tokenizer-data size_categories: - 1M<n<10M task_categories: - text-generation tags: - arabic - english - bilingual - tokenizer-training --- # DeepLatent bilingual tokenizer-data Balanced English/Arabic corpus for tokenizer training. The two languages carry essentially the same number of Unicode codepoints, so a BPE/WordPiece tokenizer trained on this corpus sees equal representation of both languages by character content. ## Composition | Slice | Source | Filter | Rows | Chars | |----------|-----------------------------------------------------------------------------------------|----------------------------------------------------|-------------|--------------------| | English | [`almaghrabima/deeplatent-hq-merged-dedup-token-counts`](https://huggingface.co/datasets/almaghrabima/deeplatent-hq-merged-dedup-token-counts) | GlotLID `language == "English"` | 3,980,035 | 12,377,830,142 | | Arabic | [`AdaMLLab/AraMix-HQ`](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) | `mmbert_score >= 0.2784` ∧ `language != "English"` | 2,380,570 | 12,380,271,596 | | **Total**| | | **6,360,605** | **24,758,101,738** | The Arabic threshold `mmbert_score >= 0.2784` was chosen so the Arabic char count matches the English char count (balancing the two languages). This yields the top ~7% highest-scoring Arabic content in AraMix-HQ. Language labels come from GlotLID (`cis-lmu/glotlid`) run on the first 2000 characters of each document. The HQ corpus is fully labeled; the AraMix-HQ source was partially labeled (~38.5% of shards) — remaining Arabic rows in this merged release default to `"Arabic"` since labeled AraMix-HQ shards were 96.4% Arabic. ## Schema | Column | Type | Description | |-----------|--------|-----------------------------------------------------------------| | `text` | string | Document text | | `source` | string | Original sub-source (e.g. `ar`, `en`, `lightonai/ArabicWeb24`) | | `language`| string | GlotLID label: `"English"`, `"Arabic"`, or raw `lang_Script` | | `origin` | string | `"hq"` or `"aramix"` | ## File layout - `en_00000.parquet` … `en_00222.parquet` — 223 English shards - `ar_00000.parquet` … `ar_00178.parquet` — 179 Arabic shards Each file is zstd-compressed parquet.

提供机构：

almaghrabima

5,000+

优质数据集

54 个

任务类型

进入经典数据集