Codebase for Wikidata as Gazetteer: An Open Geocoding Pipeline for Textual Corpora in the Humanities
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://doi.org/10.7910/DVN/NNGFJC
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the scripts required to implement the Wikidata-based geocoding pipeline described in the accompanying paper. geocode.sh : Shell script for setting up and executing Stanford CoreNLP with the required language models and entitylink annotator. Automates preprocessing, named entity recognition (NER), and wikification across a directory of plain-text (.txt) files. Configured for both local execution and high-performance computing (HPC) environments. geocode.py : Python script that processes the list of extracted location entities (entities.txt) and retrieves latitude/longitude coordinates from Wikidata using Pywikibot. Handles redirects, missing pages, and missing coordinate values, returning standardized placeholder codes where necessary. Outputs results as a CSV file with columns for place name, latitude, longitude, and source file. geocode.sbatch : Optional SLURM submission script for running run_corenlp.sh on HPC clusters. Includes configurable resource requests for scalable processing of large corpora. README.md : Detailed README file including a line-by-line explanation of the geocode.sh file. Together, these files provide a reproducible workflow for geocoding textual corpora via wikification, suitable for projects ranging from small-scale literary analysis to large-scale archival datasets.
创建时间:
2025-10-07



