riotu-lab/tashkeel-arabic-sentences
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/riotu-lab/tashkeel-arabic-sentences
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: sentence
dtype: string
- name: ratio
dtype: float64
splits:
- name: train
num_bytes: 132200916
num_examples: 272856
download_size: 58738892
dataset_size: 132200916
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc
task_categories:
- translation
- text-generation
language:
- ar
tags:
- NLP
- Arabic_Tashkeel
- Tashkeel
- diactarization
pretty_name: Tashkeel new dataset
size_categories:
- 100K<n<1M
---
This dataset contains Arabic sentences extracted from the `ImruQays/Alukah-Arabic` dataset.
Sentences were filtered based on their 'tashkeel' (Arabic diacritics) ratio,
with a minimum ratio of 0.3 (adjustable during extraction).
**Source:**
The original articles were sourced from the `ImruQays/Alukah-Arabic` dataset on Hugging Face.
**Processing:**
1. Articles were loaded from `ImruQays/Alukah-Arabic`.
2. Each article was split into individual sentences using a regex pattern.
3. For each sentence, the ratio of tashkeel characters to total characters was calculated.
4. Sentences with a tashkeel ratio greater than or equal to 0.3 were selected.
This dataset is intended for tasks requiring heavily diacritized Arabic text,
such as text-to-speech, diacritization models, or linguistic analysis.
数据集信息:
特征:
- 名称:sentence(句子),数据类型:string(字符串)
- 名称:ratio(比例),数据类型:float64(64位浮点数)
数据集划分:
- 名称:train(训练集),字节数:132200916,样本量:272856
下载大小:58738892
数据集总大小:132200916
配置项:
- 配置名称:default(默认配置),数据文件:
- 数据集划分:train(训练集),路径:data/train-*
许可证:cc
任务类别:
- 机器翻译(translation)
- 文本生成(text-generation)
语言:
- 阿拉伯语(ar)
标签:
- 自然语言处理(Natural Language Processing)
- 阿拉伯语变音标注(Arabic_Tashkeel)
- 变音标注(Tashkeel)
- 元音标注(diactarization)
展示名称:Tashkeel新数据集(Tashkeel new dataset)
样本规模区间:100K < n < 1M
本数据集包含从`ImruQays/Alukah-Arabic`数据集中提取的阿拉伯语句子。句子的筛选基于其变音标注(Tashkeel)占比,最低阈值为0.3(提取过程中可调整)。
**数据来源:**
原始文本来源于Hugging Face平台上的`ImruQays/Alukah-Arabic`数据集。
**处理流程:**
1. 从`ImruQays/Alukah-Arabic`数据集加载文本文章;
2. 通过正则表达式将每篇文章拆分为独立句子;
3. 计算每句中变音标注字符占总字符数的比例;
4. 筛选出变音标注比例大于或等于0.3的句子。
本数据集适用于需要高元音标注密度阿拉伯语文本的任务,例如文本转语音、变音标注模型或语言学分析。
提供机构:
riotu-lab



