five

Arabic Punctuation Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/2pkxckwgs3
下载链接
链接失效反馈
官方服务:
资源简介:
This is a curated dataset, specifically designed to facilitate the study of punctuation. It has undergone rigorous manual annotation and verification on the basis of sentence structure, with sentence boundaries clearly marked. The dataset is in three folders: 1. The ABC component of the Arabic Punctuation Dataset: This folder features the manually annotated punctuation gold standard. It consists of one chapter extracted from each of 45 non-fiction books by 36 authors from 19 different fields of study. It contains 45 text files with a total of 149K tokens in 13K sentences. 2. The CBT component: This folder has 1085 text files in 60 sub-folders, the full text of complete book translations that had been rendered from English into Arabic independently of this project. Their punctuation, we found out, mirrors the English source language texts; i.e., the sentence terminals in these Arabic texts follow the rules of English. In this folder are close to 3M words in more than 170K properly punctuated sentences. 3. The SSAC-UNPC component: This folder constitutes the third part of the Arabic Punctuation Dataset. It has close to 12M disconnected, disordered, complete sentences in 79 text files. These scrambled sentences were extracted from the predominantly legal Arabic subcorpus of the United Nations Parallel Corpus (UNPC). The punctuation here is authentic. It was done by the UN translators as part of their work. We consider this to be an excellent punctuation corpus because it mirrors the rule-governed punctuation of the English source documents, especially in relation to sentence terminals. These scrambled sentences total more than 309M words.
创建时间:
2024-01-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作