DASNUS
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/xdj9f55rkm
下载链接
链接失效反馈官方服务:
资源简介:
DASNUS is the first big annotated dataset of handwritten text lines for recognizing Central Kurdish (Sorani/CKB) scripts. It has 11,475 text-line images that were segmented and annotated. These images were gathered from 867 writers in the Kurdistan Region of Iraq (Sulaymaniyah, Erbil, Duhok, and Kirkuk). The UTF-8 encoded transcription file (.txt) that goes with each text-line image (.jpg) has the same base filename as the image. At first, we prepared 1,250 forms (5,000 pages) for 1250 writers, but only 867 forms (3,468 pages) are well written and completed, so we discarded others that incompletely filled.
Writers used structured four-page forms to write their handwriting. Page 1 was a demographic page, page 2 was a fixed printed Central Kurdish paragraph for copying, page 3 was two randomly chosen paragraphs from a pool of 2,500 unique Kurdish texts, and page 4 was a free-writing section with both lined and unlined areas. Using an Epson WorkForce Pro WF-C5890 scanner, we scanned completed forms at 600 DPI. Using OpenCV-based morphological closing and contour detection, we took text lines from pages 2–4. and, we used projection profiling to automatically fix the skew. Some inappropriate or incomplete text-lines are discarded at this stage.
The first number in each filename refers to the writer's ID, the second number is the source page (2 = fixed paragraph, 3 = random paragraphs, and 4 = free writing), and the following numbers are the line and sub-segment within that page. Page 2 has the most lines (6,002), followed by page 3 (3,405) and page 4 (2,068).
Native Central Kurdish speakers made the transcriptions, and a random sample of 500 sentences was double-annotated to check them. The inter-annotator agreement was 99.1% at the character level and 97.6% at the word level. Any disagreements were settled by a chief annotator.
The dataset is divided into 3 partitions: training (606 writers, 8,042 lines), validation (131 writers, 1,700 lines), and test (130 writers, 1,733 lines). Each part has its set of writers. There are 10,702 unique words in the training vocabulary, and there are 13,817 unique words in the cumulative vocabulary across all splits. The out-of-vocabulary (OOV) rate is about 47% in both the validation and test sets. This shows how rich the morphology of Central Kurdish is.
There are two metadata files that come with the dataset: DASNUS-TEXT-LINES.xlsx, which shows the writer ID, demographic information (age group, address, education level, gender, and handedness), and train/validation/test split assignment for each text-line image; and DASNUS-DEMOGRAPHIC-INFORMATION.xlsx, which demonstrates the demographic profile of all 867 writers.
DASNUS can be used for research on recognizing handwritten text (HTR), writer identification, and analyzing scripts for languages with few resources.
创建时间:
2026-03-10



