A biodiversity dataset graph: Biodiversity Heritage Library (BHL)
收藏Zenodo2020-07-29 更新2026-05-28 收录
下载链接:
https://zenodo.org/record/3251133
下载链接
链接失效反馈官方服务:
资源简介:
A biodiversity dataset graph: Biodiversity Heritage Library Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as the Biodiversity Heritage Library (BHL, https://biodiversitylibrary.org). "The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community." - https://biodiversitylibrary.org , June 2019. However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of all OCR item texts (e.g., individual items) available through BHL as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ) over period May - June 2019. This snapshot contains about 120GB of uncompressed OCR texts across 227k OCR BHL items. Also, a snapshot of the BHL item catalog at https://www.biodiversitylibrary.org/data/item.txt is included. The archive consists of 256 individual parts (e.g., preston-00.tar.gz, preston-01.tar.gz, ...) to allow for parallel file downloads. The archive contains three types of files: index files, provenance files and data files. Only two index and provenance files are included and have been individually included in this dataset publication. Index files provide a way to links provenance files in time to eestablish a versioning mechanism. Provenance files describe how, when and where the BHL OCR text items were retrieved. For more information, please visit https://preston.guoda.bio or https://doi.org/10.5281/zenodo.1410543). To retrieve and verify the downloaded BHL biodiversity dataset graph, first concatenate all the downloaded preston-*.tar.gz files (e.g., cat preston-*.tar.gz > preston.tar.gz). Then, extract the archives into a "data" folder. After that, verify the index of the archive by reproducing the following result: $ java -jar preston.jar history<br> <0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/pav/hasVersion> <hash://sha256/89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a> .<br> <hash://sha256/41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4> <http://purl.org/pav/previousVersion> <hash://sha256/89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a> . To check the integrity of the extracted archive, confirm that each line produce by the command "preston verify" produces lines as shown below, with each line including "CONTENT_PRESENT_VALID_HASH". Depending on hardware capacity, this may take a while. $ java -jar preston.jar verify<br> hash://sha256/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca file:/home/preston/preston-bhl/data/e0/c1/e0c131ebf6ad2dce71ab9a10aa116dcedb219ae4539f9e5bf0e57b84f51f22ca OK CONTENT_PRESENT_VALID_HASH 49458087<br> hash://sha256/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 file:/home/preston/preston-bhl/data/1a/57/1a57e55a780b86cff38697cf1b857751ab7b389973d35113564fe5a9a58d6a99 OK CONTENT_PRESENT_VALID_HASH 25745<br> hash://sha256/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c file:/home/preston/preston-bhl/data/85/ef/85efeb84c1b9f5f45c7a106dd1b5de43a31b3248a211675441ff584a7154b61c OK CONTENT_PRESENT_VALID_HASH 519892 Note that a copy of the java program "preston", preston.jar, is included in this publication. The program runs on java 8+ virtual machine using "java -jar preston.jar", or in short "preston". Files in this data publication: README - this file preston-[00-ff].tar.gz - preston archives containing BHL OCR item texts, their provenance and a provenance index. 9e8c86243df39dd4fe82a3f814710eccf73aa9291d050415408e346fa2b09e70 - preston index file<br> 2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a - preston index file 89926f33157c0ef057b6de73f6c8be0060353887b47db251bfd28222f2fd801a - preston provenance file<br> 41b19aa9456fc709de1d09d7a59c87253bc1f86b68289024b7320cef78b3e3a4 - preston provenance file<br> This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.
提供机构:
Zenodo
创建时间:
2019-06-21



