Codebase for Wikidata as Gazetteer: An Open Geocoding Pipeline for Textual Corpora in the Humanities

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://doi.org/10.7910/DVN/NNGFJC

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains the scripts required to implement the Wikidata-based geocoding pipeline described in the accompanying paper. geocode.sh : Shell script for setting up and executing Stanford CoreNLP with the required language models and entitylink annotator. Automates preprocessing, named entity recognition (NER), and wikification across a directory of plain-text (.txt) files. Configured for both local execution and high-performance computing (HPC) environments. geocode.py : Python script that processes the list of extracted location entities (entities.txt) and retrieves latitude/longitude coordinates from Wikidata using Pywikibot. Handles redirects, missing pages, and missing coordinate values, returning standardized placeholder codes where necessary. Outputs results as a CSV file with columns for place name, latitude, longitude, and source file. geocode.sbatch : Optional SLURM submission script for running run_corenlp.sh on HPC clusters. Includes configurable resource requests for scalable processing of large corpora. README.md : Detailed README file including a line-by-line explanation of the geocode.sh file. Together, these files provide a reproducible workflow for geocoding textual corpora via wikification, suitable for projects ranging from small-scale literary analysis to large-scale archival datasets.

创建时间：

2025-10-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集