UJI笔字符(第2版)数据集
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-26296.html
下载链接
链接失效反馈官方服务:
资源简介:
F. Prat(*), M. J. Castro(+), D. Llorens(*), A. Marzal(*), and J. M. Vilar(*) * Departamento de Lenguajes y Sistemas Informáticos Universitat Jaume I (UJI), 12071 Castellón, SPAIN + Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia (UPV), 46071 Valencia, SPAIN fprat '@' lsi.uji.es December 2008 Data Set Information: We have created the UJIpenchars2 character database by collecting samples from 60 writers at two different sites in two phases: Each writer contributed with letters, digits, and other characters and two samples were collected for each pair writer/character. The complete lexicon is as follows: So the total number of samples in this database is 11640: 60 writers x (66+10+21) characters x 2 repetitions UJIpenchars is a subset of UJIpenchars2 with only 1364 samples: the ASCII letters and digits collected at UJI during the 1st acquisition phase. We have not defined a standard task for UJIpenchars2, but divided the writer set into two disjoint subsets in order to ease the definition of writer independent tasks: The distribution of our database consists of 2 files: The handwriting samples were collected on a Toshiba Portégé M400 Tablet PC using its cordless stylus. Each one of the 60 writers completed 2 non-consecutive sessions. In each session, the corresponding writer was asked to write one exemplar for each character in the lexicon. The acquisition program shows a set of boxes on the screen, one for each required character, and writers are told to write only inside those boxes. Each acquisition box is approximately 13.6 millimetres wide and 20.4 millimetres tall and contais two horizontal guides at approximate distances of 7.5 and 12.7 millimetres from top, respectively. Writers were instructed to clear the content of the corresponding box by using an on-screen button and try again whenever they made a mistake or were unhappy with the writing of any character. Subjects were monitored only when writing their first exemplars and every sample considered OK by its writer was accepted, even if some of its points lay out of the corresponding acquisition box. only X and Y coordinate information was recorded along the strokes by the acquisition program, without, for instance, pressure level values or timing information. Thus, in multi-stroke samples, no information at all was recorded between strokes. Both coordinates were expressed as integer ink units, with the origin lying at the top left corner of the corresponding acquisition box. X values grow left-to-right and Y values grow downwards. Although we have employed the same acquisition program on identical hardware at UJI and UPV, we have observed that acquisition files seem to show that UPV samples have been collected using acquisition boxes larger than UJI ones. This is due to a different configuration parameter value that, at UPV, makes the acquisition program translate 1 millimetre into 152 ink units, instead of using the standard UJI ratio: 100 ink units per millimetre. If box homogenisation is needed, it can be easily achieved, for instance, by dividing UPV coordinate values by 1.52. We have also observed that runs of consecutive points with identical coordinates were frequently acquired inside strokes; such runs were preserved in this database, so it is up to its users to decide whether to avoid them by an appropriate preprocessing step or not. Although it is a paper mainly devoted to UJIpenchars, D. Llorens et al.: 'The UJIpenchars Database: A Pen-based Database of Isolated Handwritten Characters' Proc. of the 6th International Conference on Language Resources and evaluation. 2008. contains useful information about UJIpenchars2. It can be found in [Web link]. Attribute Information: The file 'ujipenchars2.txt' is a text one with a simple format where all database samples are represented. Because some non-ASCII characters are needed, UTF-8 encoding is employed. In order to describe how attributes are represented in 'ujipenchars2.txt', it is worth explaining the general syntax of the file first. From the higher-level point of view, this file is composed of comment lines and sample representations. A comment line is one beginning with two slashes. In 'ujipenchars2.txt', we have employed comment lines for two purposes: A sample representation is composed of a header line followed by the representation of its *sequence of strokes*, where the header line consists of three blank-separated elements: the word 'WORD', the representation of the *character identity*, and the *session identifier*. For instance, a semicolon sample representation may look like this: A detailed description on how information about each attribute is represented in 'ujipenchars2.txt' follows:
提供机构:
帕依提提



