five

Dataset & code for "Using large language models to address the bottleneck of georeferencing natural history collections"

收藏
Figshare2025-11-17 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Dataset_code_for_Using_large_language_models_to_address_the_bottleneck_of_georeferencing_natural_history_collections_/28904936/1
下载链接
链接失效反馈
官方服务:
资源简介:
Datasets and codes, which are used in the paper "Using large language models to address the bottleneck of georeferencing natural history collections"1. System requirements: Windows 10; R language: v 4.2.2; Python: v 3.8.122. Instructions for use: The "data" folder contain the key sampling and intermediate data in the analysis process of this study. The initial specimen dataset included a total of 13,064,051 records from the Global Biodiversity Information Facility (GBIF) can be downloaded from GBIF DOI: https://doi.org/10.15468/dl.fj3sqk.Data file name and its meaning or purpose:occurrence_filter_clean.csv: The data before sampling 5,000 records based on continents, after cleaning the initial specimen datamain data frame 5000_only country state county locality.csv: The 5,000 sample data used for georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 100_only country state county locality.csv: The 100 sub-sample data used for humnan and reasoning-LLM georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 5000.csv: records all output data and required records from the analysis of 5,000 sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsmain data frame 100.csv: records all output data and required records from the analysis of 100 sub-sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsgeoref_errorDis.csv: used for Figure 1bsummary_error_time_cost.csv: time taken and cost records for various georeferencing methods, used for Figure 4for_human_completed.csv: results of manual georeferencing by the participantshf_v2geo.tif: Global Human Footprint Dataset (Geographic) (Version 2.00), from https://gis.earthdata.nasa.gov/portal/home/item.html?id=048c92f5ce50462a86b0837254924151, used for Figure 5acountry file folder: global country and county polygon vector data, used to extract centroid coordinates of counties in ArcGIS v10.8<br>
提供机构:
feng, xiao; Xie, Yuyang
创建时间:
2025-11-17
二维码
社区交流群
二维码
科研交流群
商业服务