Data for Optical Character Recognition Applied to Hieratic: Sign Identification and Broad Analysis

DataONE2023-06-03 更新2024-06-08 收录

下载链接：

https://search.dataone.org/view/sha256:c75f0f09ebffb380c4fcf24fdfc1aa5a52037289cf800650a0ed05b7964f19f5

下载链接

链接失效反馈

官方服务：

资源简介：

This data consists of a number of .zip files containing everything needed to run the hieratic optical character recognition program presented at https://github.com/jtabin/PaPYrus. The files included are as follows: 1. \"Dataset By Sign\": This is all 13,134 data set images, categorized in folders by their Gardiner sign. Each image is a black and white .png image of a hieratic sign. The signs are labeled with unique identifiers, corresponding in order to their placement in a text from the 1st (0001) to the 9999th (9999), facsimile maker (1 for Möller, 2 for Poe, 3 for Tabin), provenance (1: Thebes, 2: Lahun, 3: Hatnub, 4: Unknown), and original text (1: Shipwrecked Sailor, 2: Eloquent Peasant B1, 3: Eloquent Peasant R, 4: Sinuhe B, 5: Sinuhe R, 6: Papyrus Prisse, 7: Hymn to Senwosret III, 8: Lahun Temple Files, 9: Will of Wah, 10: Texte aus Hatnub, 11: Papyrus Ebers, 12: Rhind Papyrus, 13: Papyrus Westcar). 2. \"Dataset Categorized\": This is every data set image, as above, categorized in folders by their provenance, text, and facsimile maker (i.e. where the tags originate from). 3. \"Dataset Whole\": This is every data set image in one folder. This is what is used for the analyses done by the OCR program. 4. \"Precalculated Data Set Stats\": This is a collection of .csv files outputted by the \"Data Set Prep.ipynb\" code (code found on the aforementioned GitHub page). \"pxls_16.csv\", \"pxls_20.csv\", and \"pxls_25.csv\" are the pixel values for every sign in the data set, after they were resized to 16, 20, and 25 pixels, respectively. \"datasetstats.csv\" includes the aspect ratios and sign names for every sign in the data set. The two files beginning with \"A1cut\" are the same stats, but after every A1 sign had its tail manually cut off. 5. \"Precalculated OCR Results\": This is a collection of .csv files outputted by the \"Image Identification.ipynb\" code (also found on the GitHub page). The files are mostly the product of all of one sign from the data set being run through the OCR program and they are labeled with the name of the sign. These result in columns of signs and their similarity scores when compared to other signs. Some files, such as \"randsamp_fullresults.csv\", come from other analyses explained in their file names (that file, for instance, is a random sample from the data set).

创建时间：

2023-11-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集