riotu-lab/tashkeel-arabic-sentences

Name: riotu-lab/tashkeel-arabic-sentences
Creator: riotu-lab
Published: 2026-04-08 12:54:02
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/riotu-lab/tashkeel-arabic-sentences

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: sentence dtype: string - name: ratio dtype: float64 splits: - name: train num_bytes: 132200916 num_examples: 272856 download_size: 58738892 dataset_size: 132200916 configs: - config_name: default data_files: - split: train path: data/train-* license: cc task_categories: - translation - text-generation language: - ar tags: - NLP - Arabic_Tashkeel - Tashkeel - diactarization pretty_name: Tashkeel new dataset size_categories: - 100K<n<1M --- This dataset contains Arabic sentences extracted from the `ImruQays/Alukah-Arabic` dataset. Sentences were filtered based on their 'tashkeel' (Arabic diacritics) ratio, with a minimum ratio of 0.3 (adjustable during extraction). **Source:** The original articles were sourced from the `ImruQays/Alukah-Arabic` dataset on Hugging Face. **Processing:** 1. Articles were loaded from `ImruQays/Alukah-Arabic`. 2. Each article was split into individual sentences using a regex pattern. 3. For each sentence, the ratio of tashkeel characters to total characters was calculated. 4. Sentences with a tashkeel ratio greater than or equal to 0.3 were selected. This dataset is intended for tasks requiring heavily diacritized Arabic text, such as text-to-speech, diacritization models, or linguistic analysis.

数据集信息：特征： - 名称：sentence（句子），数据类型：string（字符串） - 名称：ratio（比例），数据类型：float64（64位浮点数）数据集划分： - 名称：train（训练集），字节数：132200916，样本量：272856 下载大小：58738892 数据集总大小：132200916 配置项： - 配置名称：default（默认配置），数据文件： - 数据集划分：train（训练集），路径：data/train-* 许可证：cc 任务类别： - 机器翻译（translation） - 文本生成（text-generation）语言： - 阿拉伯语（ar）标签： - 自然语言处理（Natural Language Processing） - 阿拉伯语变音标注（Arabic_Tashkeel） - 变音标注（Tashkeel） - 元音标注（diactarization）展示名称：Tashkeel新数据集（Tashkeel new dataset）样本规模区间：100K < n < 1M 本数据集包含从`ImruQays/Alukah-Arabic`数据集中提取的阿拉伯语句子。句子的筛选基于其变音标注（Tashkeel）占比，最低阈值为0.3（提取过程中可调整）。 **数据来源：** 原始文本来源于Hugging Face平台上的`ImruQays/Alukah-Arabic`数据集。 **处理流程：** 1. 从`ImruQays/Alukah-Arabic`数据集加载文本文章； 2. 通过正则表达式将每篇文章拆分为独立句子； 3. 计算每句中变音标注字符占总字符数的比例； 4. 筛选出变音标注比例大于或等于0.3的句子。本数据集适用于需要高元音标注密度阿拉伯语文本的任务，例如文本转语音、变音标注模型或语言学分析。

提供机构：

riotu-lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集