five

riotu-lab/tashkeel-arabic-sentences

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/riotu-lab/tashkeel-arabic-sentences
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: sentence dtype: string - name: ratio dtype: float64 splits: - name: train num_bytes: 132200916 num_examples: 272856 download_size: 58738892 dataset_size: 132200916 configs: - config_name: default data_files: - split: train path: data/train-* license: cc task_categories: - translation - text-generation language: - ar tags: - NLP - Arabic_Tashkeel - Tashkeel - diactarization pretty_name: Tashkeel new dataset size_categories: - 100K<n<1M --- This dataset contains Arabic sentences extracted from the `ImruQays/Alukah-Arabic` dataset. Sentences were filtered based on their 'tashkeel' (Arabic diacritics) ratio, with a minimum ratio of 0.3 (adjustable during extraction). **Source:** The original articles were sourced from the `ImruQays/Alukah-Arabic` dataset on Hugging Face. **Processing:** 1. Articles were loaded from `ImruQays/Alukah-Arabic`. 2. Each article was split into individual sentences using a regex pattern. 3. For each sentence, the ratio of tashkeel characters to total characters was calculated. 4. Sentences with a tashkeel ratio greater than or equal to 0.3 were selected. This dataset is intended for tasks requiring heavily diacritized Arabic text, such as text-to-speech, diacritization models, or linguistic analysis.

数据集信息: 特征: - 名称:sentence(句子),数据类型:string(字符串) - 名称:ratio(比例),数据类型:float64(64位浮点数) 数据集划分: - 名称:train(训练集),字节数:132200916,样本量:272856 下载大小:58738892 数据集总大小:132200916 配置项: - 配置名称:default(默认配置),数据文件: - 数据集划分:train(训练集),路径:data/train-* 许可证:cc 任务类别: - 机器翻译(translation) - 文本生成(text-generation) 语言: - 阿拉伯语(ar) 标签: - 自然语言处理(Natural Language Processing) - 阿拉伯语变音标注(Arabic_Tashkeel) - 变音标注(Tashkeel) - 元音标注(diactarization) 展示名称:Tashkeel新数据集(Tashkeel new dataset) 样本规模区间:100K < n < 1M 本数据集包含从`ImruQays/Alukah-Arabic`数据集中提取的阿拉伯语句子。句子的筛选基于其变音标注(Tashkeel)占比,最低阈值为0.3(提取过程中可调整)。 **数据来源:** 原始文本来源于Hugging Face平台上的`ImruQays/Alukah-Arabic`数据集。 **处理流程:** 1. 从`ImruQays/Alukah-Arabic`数据集加载文本文章; 2. 通过正则表达式将每篇文章拆分为独立句子; 3. 计算每句中变音标注字符占总字符数的比例; 4. 筛选出变音标注比例大于或等于0.3的句子。 本数据集适用于需要高元音标注密度阿拉伯语文本的任务,例如文本转语音、变音标注模型或语言学分析。
提供机构:
riotu-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作