five

地理位置匹配数据集

收藏
海数据2026-03-14 收录
下载链接:
https://haidatas.com/dataset/diliweizhipipeishujuji_afab8e17
下载链接
链接失效反馈
官方服务:
资源简介:
地理位置匹配数据集_Geolocation_Matching_Dataset 数据来源:互联网公开数据 标签:地理位置, 实体匹配, 文本相似度, 坐标距离, 数据融合, 机器学习, 地理信息系统, 匹配预测 数据概述: 该数据集包含从不同来源收集的地理位置相关实体的数据,旨在用于实体匹配任务。主要特征如下: 时间跨度:数据未明确标注时间信息,可视为静态数据集。 地理范围:数据覆盖范围未具体说明,但从地名和国家缩写(BE)推测,可能涉及比利时地区。 数据维度:数据集包括两组实体(实体1和实体2)的多个属性,包括: id_1, name_1, latitude_1, longitude_1: 实体1的ID、名称、纬度和经度。 id_2, name_2, latitude_2, longitude_2: 实体2的ID、名称、纬度和经度。 feat_incl: 特征是否包含的标志。 latdiff, londiff: 纬度和经度的差值。 manhattan, euclidean, haversine: 曼哈顿距离、欧几里得距离、以及Haversine距离。 name_geshs, name_levens, name_jaros, name_len_1, name_len_2, name_nlevens: 基于名称的Gensh、Levenshtein、Jaro相似度,以及名称长度和归一化Levenshtein距离。 cat_match: 类别匹配的指标。 address_geshs, address_levens, address_jaros, address_len_1, address_len_2, address_nlevens: 针对地址的相似度指标。 city_geshs, city_levens, city_jaros, city_len_1, city_len_2, city_nlevens: 针对城市的相似度指标。 state_geshs, state_levens, state_jaros, state_len_1, state_len_2, state_nlevens: 针对州的相似度指标。 zip_geshs, zip_levens, zip_jaros, country_geshs, country_levens, country_jaros: 针对邮编和国家的相似度指标。 url_geshs, url_levens, url_jaros, url_len_1, url_len_2, url_nlevens, phone_geshs, phone_levens, phone_jaros: 针对URL和电话号码的相似度指标。 categories_geshs, categories_levens, categories_jaros, categories_len_1, categories_len_2, categories_nlevens: 针对类别的相似度指标。 text: 包含实体的文本描述,使用[SEP]分隔符分割。 target: 匹配标签,表示两个实体是否匹配(0表示不匹配,1表示匹配)。 数据格式:CSV格式,文件名为test_dataset.csv,方便数据分析和模型训练。 该数据集已对地理位置信息、文本信息及其他相关属性进行了处理,并提供了用于实体匹配的特征。 数据用途概述: 该数据集具有广泛的应用潜力,特别适用于以下场景: 研究与分析:适用于地理信息系统(GIS)、实体对齐、信息检索等领域的学术研究,如基于多源信息的实体匹配算法研究。 行业应用:为地图服务、地址数据库、商业智能等行业提供数据支持,尤其在地址清洗、POI(Point of Interest,兴趣点)匹配、地点推荐等方面具备实用价值。 决策支持:支持企业进行客户数据整合、市场分析、供应链管理等决策,提高数据驱动的决策效率。 教育和培训:作为地理信息系统、数据挖掘、机器学习等课程的实训材料,帮助学生和研究人员理解实体匹配的原理和方法。 此数据集特别适合用于探索地理位置信息、文本信息和属性特征在实体匹配中的作用,并构建相应的机器学习模型,实现对实体匹配的自动化和智能化,从而优化决策流程。

Geolocation Matching Dataset Data Source: Publicly available data from the Internet Tags: geolocation, entity matching, text similarity, coordinate distance, data fusion, machine learning, geographic information system (GIS), matching prediction Data Overview: This dataset contains geolocation-related entity data collected from various sources, and is designed for entity matching tasks. Its main features are as follows: Time Span: No explicit time information is annotated in the data, so it can be regarded as a static dataset. Geographic Scope: The specific coverage of the data is not specified, but it is inferred that it may cover the Belgian region based on place names and the country abbreviation (BE). Data Dimensions: The dataset includes multiple attributes of two sets of entities (Entity 1 and Entity 2), including: id_1, name_1, latitude_1, longitude_1: ID, name, latitude and longitude of Entity 1. id_2, name_2, latitude_2, longitude_2: ID, name, latitude and longitude of Entity 2. feat_incl: A flag indicating whether features are included. latdiff, londiff: Latitude difference and longitude difference. manhattan, euclidean, haversine: Manhattan distance, Euclidean distance, and Haversine distance. name_geshs, name_levens, name_jaros, name_len_1, name_len_2, name_nlevens: Gensh-based similarity, Levenshtein similarity, Jaro similarity of the names, as well as the lengths of the two names and the normalized Levenshtein distance of the names. cat_match: Category matching indicator. address_geshs, address_levens, address_jaros, address_len_1, address_len_2, address_nlevens: Similarity indicators for addresses. city_geshs, city_levens, city_jaros, city_len_1, city_len_2, city_nlevens: Similarity indicators for cities. state_geshs, state_levens, state_jaros, state_len_1, state_len_2, state_nlevens: Similarity indicators for states/provinces. zip_geshs, zip_levens, zip_jaros, country_geshs, country_levens, country_jaros: Similarity indicators for zip codes and countries. url_geshs, url_levens, url_jaros, url_len_1, url_len_2, url_nlevens, phone_geshs, phone_levens, phone_jaros: Similarity indicators for URLs and phone numbers. categories_geshs, categories_levens, categories_jaros, categories_len_1, categories_len_2, categories_nlevens: Similarity indicators for categories. text: Text descriptions of the entities, separated by the [SEP] delimiter. target: Matching label, indicating whether the two entities match (0 means non-matching, 1 means matching). Data Format: CSV format, with the file name test_dataset.csv, facilitating data analysis and model training. This dataset has processed geolocation information, text information and other related attributes, and provides features for entity matching tasks. Data Application Overview: This dataset has broad application potential and is particularly suitable for the following scenarios: Research and Analysis: Suitable for academic research in fields such as geographic information systems (GIS), entity alignment, information retrieval, etc., such as research on multi-source information-based entity matching algorithms. Industrial Applications: Provides data support for industries such as map services, address databases, business intelligence, etc., and has practical value especially in address cleaning, POI (Point of Interest) matching, location recommendation and other aspects. Decision Support: Supports enterprises in making decisions such as customer data integration, market analysis, supply chain management, etc., improving the efficiency of data-driven decision-making. Education and Training: Used as practical training materials for courses such as geographic information systems, data mining, machine learning, etc., helping students and researchers understand the principles and methods of entity matching. This dataset is particularly suitable for exploring the role of geolocation information, text information and attribute features in entity matching, and building corresponding machine learning models to achieve automated and intelligent entity matching, thereby optimizing decision-making processes.
提供机构:
互联网公开数据
创建时间:
2026-02-20
二维码
社区交流群
二维码
科研交流群
商业服务