five

Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954

收藏
ICPSR2023-01-01 更新2026-04-16 收录
下载链接:
https://www.icpsr.umich.edu/web/ICPSR/studies/38906
下载链接
链接失效反馈
官方服务:
资源简介:
Background This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages. Digitizing and Parsing The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions. For more information on the digitization and parsing processes, please refer to the project team's article. The combined output of those processes is presented as Dataset 1. Users should note that, although the project team took steps to find the most accurate OCR processes for this study, OCR is not perfect. There are errors in these data when compared against the original index cards. Cleaning and Geographic Standardization The project team was most interested in the name, city, and state fields in the OCR output. With this in mind, the team created a working dataset comprised of only those fields. The original images included images of the reverse sides of index cards when pencil notations were present; these records were removed from this working dataset. The index also included blue-colored cards that referred to other cards; these reference cards were also removed from the working file. The removal of these two types of records left 24,589 mortgage records in the dataset. Several steps were then taken to prepare the name, city, and state fields for future analysis. The name fields were parsed to separate middle names/initials as well as suffixes (Jr., III, etc.) from first names. The state field was standardized to the two-letter United States Postal Service state codes. The two-letter codes were also translated to their corresponding two-digit Federal Information Processing System (FIPS) codes. Using the standardized states, the team attempted to standardize each record's city to the United States Census Bureau's list of places. Attempts were made to deterministically match the city names to the Census Bureau's list. For unmatched records, probabilistic matching was used. Due to the inexact nature of probabilistic matching, the wrong place name or city FIPS code may have been assigned in error, in some cases. The result of this cleaning and geographic standardization is presented in Dataset 2. The project team created a truth deck of 1,000 records, hand-keyed from the original images. Each truth record contains the last name of the mortgagor(s), the name (whatever combination of first, middle, and suffix might appear on the card) of the first mortgagor, the name of the second mortgagor if applicable, the city, and the state. These hand-keyed records were then further parsed and geographically standardized in the same manner described above. This truth dataset is presented in Dataset 3. Dataset 4 is a combination of Datasets 2 and 3, with the truth records replacing the corresponding 1,000 records of Dataset 2.
提供机构:
Inter-university Consortium for Political and Social Research
创建时间:
2023-01-01
二维码
社区交流群
二维码
科研交流群
商业服务