JODA - A Dataset of Jordanian Dialect and Erroneous Modern Arabic Sentences coupled with Proper MSA and Full Diacritics

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/ffrskd27f4

下载链接

链接失效反馈

官方服务：

资源简介：

The Jordanian Dialect Arabic (JODA) dataset is a carefully constructed corpus designed to support advancements in Arabic Natural Language Processing (NLP), particularly in the areas of dialectal language processing and error correction for Modern Standard Arabic (MSA). It consists of 59,135 text sequences derived from informal Jordanian Arabic and formal MSA containing linguistic errors. Each input sequence is aligned with two corrected versions: one in non-diacritized MSA and another fully diacritized. The dataset was compiled from a diverse range of sources, including public user-generated comments on social media platforms (Facebook, Instagram, YouTube, X/Twitter), transcriptions of Jordanian films, and existing Arabic dialect corpora. All entries were preprocessed to remove non-linguistic content and personally identifiable information. Sentences were then segmented into shorter units to enhance their usability in downstream machine learning applications. Manual annotation was performed by expert linguists: Jordanian dialect sentences were translated into proper MSA , while erroneous MSA inputs were edited to conform to proper spelling and grammar conventions. Each of the 59,135 entries in the JODA dataset is represented as rows comprising the following components: - Source – Indicates the origin of the text, whether from social media (e.g., Facebook, YouTube, X, Instagram), transcribed Jordanian movies, or existing public Arabic dialect datasets, notably the SDC and DART corpora. - Text – Contains the original input sentence - Type – A binary classification field: value 0 denotes sentences in Jordanian Dialect, while value 1 indicates erroneous MSA sequences. - Corrected Text – The corresponding sentence corrected into proper MSA, without diacritics. - Diacritized Text – The same corrected sentence in MSA, fully annotated with diacritics. The dataset is divided into three files: - diacritized_train_set.xlsx (54,135 text sequences) - diacritized_test_set.xlsx (2500 text sequence) - diacritized_valid_set.xlsx (2500 text sequence) Credit: Given that this dataset is governed by the CC BY 4.0 license, please refer and cite the following publication: G. Abandah, M. Khaleel, I. Jafar, M. Abdel-Majeed, Y. Hamdan, A. Suyyagh, A. Abdel-Karim, S. AlAwawdeh, “Jordanian Arabic to Modern Standard Arabic Translation Using a Large Model Tuned on a Purpose-Built Dataset and Synthetic Error Injection,” Jordanian Journal of Computers and Information Technology (JJCIT), Accepted for publication, Jun 2025. Credit of the diacritized version: R. Otoum, "A Dual-Function Large Language Model for Correcting Arabic Spelling Mistakes and Adding Diacritics: Bridging Jordanian Dialect and Formal Arabic," MSc Thesis, The University of Jrodan, Jun 2025. Kindly also cite the dataset in this repository.

约旦阿拉伯语方言（Jordanian Dialect Arabic, JODA）数据集是一套精心构建的语料库，旨在支撑阿拉伯语自然语言处理（Natural Language Processing, NLP）领域的技术进步，尤其聚焦于方言语言处理与现代标准阿拉伯语（Modern Standard Arabic, MSA）错误纠正两大方向。该数据集共包含59135条文本序列，素材源自非正式约旦阿拉伯语与存在语言错误的正式现代标准阿拉伯语。每条输入序列均配有两份校正后版本：一份为无变音符号的现代标准阿拉伯语版本，另一份为完整标注变音符号的版本。本数据集的素材来源多元丰富，涵盖社交媒体平台（Facebook、Instagram、YouTube、X/Twitter）上的公开用户生成评论、约旦电影的台词转写内容，以及现有的阿拉伯语方言语料库。所有条目均经过预处理流程，移除了非语言内容与个人可识别信息；随后将语句切分为更短的单元，以提升其在下游机器学习应用中的可用性。标注工作由专业语言学家手动完成：约旦阿拉伯语方言语句将被译为规范的现代标准阿拉伯语，而存在错误的现代标准阿拉伯语输入则会被编辑为符合规范拼写与语法要求的版本。 JODA数据集的59135条条目均以行的形式组织，包含以下组成部分： - 来源（"Source"）：标识文本的原始出处，可分为社交媒体（如Facebook、YouTube、X、Instagram）、约旦电影转写内容，或是现有的公开阿拉伯语方言语料库（尤其是SDC与DART语料库）三类。 - 原文（"Text"）：包含原始输入语句。 - 类型（"Type"）：二分类字段，取值0代表约旦阿拉伯语方言语句，取值1代表存在错误的现代标准阿拉伯语序列。 - 校正后文本（"Corrected Text"）：经校正后的规范现代标准阿拉伯语文本，未标注变音符号。 - 带变音符号文本（"Diacritized Text"）：经校正后的规范现代标准阿拉伯语文本，完整标注了变音符号。该数据集分为三个文件： - diacritized_train_set.xlsx（包含54135条文本序列） - diacritized_test_set.xlsx（包含2500条文本序列） - diacritized_valid_set.xlsx（包含2500条文本序列） ### 引用声明本数据集采用CC BY 4.0许可协议，请参考并引用以下发表成果：G. Abandah、M. Khaleel、I. Jafar、M. Abdel-Majeed、Y. Hamdan、A. Suyyagh、A. Abdel-Karim、S. AlAwawdeh，《基于专用数据集与人工错误注入微调的大语言模型实现约旦阿拉伯语向现代标准阿拉伯语翻译》，《约旦计算机与信息技术期刊》（Jordanian Journal of Computers and Information Technology, JJCIT），2025年6月录用待刊。变音符号版本的致谢：R. Otoum，《兼具拼写错误纠正与变音符号标注功能的双任务大语言模型：架起约旦阿拉伯语方言与正式阿拉伯语的桥梁》，约旦大学硕士论文，2025年6月。同时请引用本仓库中的该数据集。

创建时间：

2025-07-02