AutoMeta-ETD500
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://doi.org/10.7910/DVN/18D6AZ
下载链接
链接失效反馈官方服务:
资源简介:
AutoMeta-ETD500 contains 500 scanned Electronic Theses and Dissertations (ETDs). This dataset is used to develop a framework called AutoMeta, which automatically extracts seven key metadata fields (e.g., title, author, advisor, university, department, university, and year), which are ubiquitous to ETDs. For this task, the dataset has been derived into the following seven intermediate datasets: a) PDF.zip: This zip file contains 500 ETD samples from different US and non-US universities. b) XML_JSON.zip: This zip file contains 100 ETD metadata that have been downloaded from MIT and Virginia Tech ETD library repositories. c) HTML.zip: This zip file contains the remaining 400 ETD metadata, which have been downloaded from ProQuest. d) Tiff.zip: This zip file contains the Tiff images of cover pages of 500 scanned ETDs. e) noisy.zip: This zip file contains all the noisy data for 500 ETD samples. This is generated by tesseract OCR. f) clean.zip: This zip file contains all the clean data of 500 ETD samples, and this dataset has been manually rectified from noisy data. g) annotated.zip: This zip file contains all annotated data in XML. Annotation is done using the GATE annotation tool.
创建时间:
2023-08-09



