five

almaghrabima/tokenizer-data

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/almaghrabima/tokenizer-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - ar pretty_name: DeepLatent bilingual tokenizer-data size_categories: - 1M<n<10M task_categories: - text-generation tags: - arabic - english - bilingual - tokenizer-training --- # DeepLatent bilingual tokenizer-data Balanced English/Arabic corpus for tokenizer training. The two languages carry essentially the same number of Unicode codepoints, so a BPE/WordPiece tokenizer trained on this corpus sees equal representation of both languages by character content. ## Composition | Slice | Source | Filter | Rows | Chars | |----------|-----------------------------------------------------------------------------------------|----------------------------------------------------|-------------|--------------------| | English | [`almaghrabima/deeplatent-hq-merged-dedup-token-counts`](https://huggingface.co/datasets/almaghrabima/deeplatent-hq-merged-dedup-token-counts) | GlotLID `language == "English"` | 3,980,035 | 12,377,830,142 | | Arabic | [`AdaMLLab/AraMix-HQ`](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ) | `mmbert_score >= 0.2784` ∧ `language != "English"` | 2,380,570 | 12,380,271,596 | | **Total**| | | **6,360,605** | **24,758,101,738** | The Arabic threshold `mmbert_score >= 0.2784` was chosen so the Arabic char count matches the English char count (balancing the two languages). This yields the top ~7% highest-scoring Arabic content in AraMix-HQ. Language labels come from GlotLID (`cis-lmu/glotlid`) run on the first 2000 characters of each document. The HQ corpus is fully labeled; the AraMix-HQ source was partially labeled (~38.5% of shards) — remaining Arabic rows in this merged release default to `"Arabic"` since labeled AraMix-HQ shards were 96.4% Arabic. ## Schema | Column | Type | Description | |-----------|--------|-----------------------------------------------------------------| | `text` | string | Document text | | `source` | string | Original sub-source (e.g. `ar`, `en`, `lightonai/ArabicWeb24`) | | `language`| string | GlotLID label: `"English"`, `"Arabic"`, or raw `lang_Script` | | `origin` | string | `"hq"` or `"aramix"` | ## File layout - `en_00000.parquet` … `en_00222.parquet` — 223 English shards - `ar_00000.parquet` … `ar_00178.parquet` — 179 Arabic shards Each file is zstd-compressed parquet.
提供机构:
almaghrabima
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作