five

AutoMeta-ETD500

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://doi.org/10.7910/DVN/18D6AZ
下载链接
链接失效反馈
官方服务:
资源简介:
AutoMeta-ETD500 contains 500 scanned Electronic Theses and Dissertations (ETDs). This dataset is used to develop a framework called AutoMeta, which automatically extracts seven key metadata fields (e.g., title, author, advisor, university, department, university, and year), which are ubiquitous to ETDs. For this task, the dataset has been derived into the following seven intermediate datasets: a) PDF.zip: This zip file contains 500 ETD samples from different US and non-US universities. b) XML_JSON.zip: This zip file contains 100 ETD metadata that have been downloaded from MIT and Virginia Tech ETD library repositories. c) HTML.zip: This zip file contains the remaining 400 ETD metadata, which have been downloaded from ProQuest. d) Tiff.zip: This zip file contains the Tiff images of cover pages of 500 scanned ETDs. e) noisy.zip: This zip file contains all the noisy data for 500 ETD samples. This is generated by tesseract OCR. f) clean.zip: This zip file contains all the clean data of 500 ETD samples, and this dataset has been manually rectified from noisy data. g) annotated.zip: This zip file contains all annotated data in XML. Annotation is done using the GATE annotation tool.
创建时间:
2023-08-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作