Arabic Punctuation Dataset

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/2pkxckwgs3

下载链接

链接失效反馈

官方服务：

资源简介：

This is a curated dataset, specifically designed to facilitate the study of punctuation. It has undergone rigorous manual annotation and verification on the basis of sentence structure, with sentence boundaries clearly marked. The dataset is in three folders: 1. The ABC component of the Arabic Punctuation Dataset: This folder features the manually annotated punctuation gold standard. It consists of one chapter extracted from each of 45 non-fiction books by 36 authors from 19 different fields of study. It contains 45 text files with a total of 149K tokens in 13K sentences. 2. The CBT component: This folder has 1085 text files in 60 sub-folders, the full text of complete book translations that had been rendered from English into Arabic independently of this project. Their punctuation, we found out, mirrors the English source language texts; i.e., the sentence terminals in these Arabic texts follow the rules of English. In this folder are close to 3M words in more than 170K properly punctuated sentences. 3. The SSAC-UNPC component: This folder constitutes the third part of the Arabic Punctuation Dataset. It has close to 12M disconnected, disordered, complete sentences in 79 text files. These scrambled sentences were extracted from the predominantly legal Arabic subcorpus of the United Nations Parallel Corpus (UNPC). The punctuation here is authentic. It was done by the UN translators as part of their work. We consider this to be an excellent punctuation corpus because it mirrors the rule-governed punctuation of the English source documents, especially in relation to sentence terminals. These scrambled sentences total more than 309M words.

创建时间：

2024-01-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集