company_names_data
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://figshare.com/articles/dataset/company_names_data/21385260
下载链接
链接失效反馈官方服务:
资源简介:
The data contain a sample of 1,597,336 pairs of firms matched from two data sources: HeadHunter job board platform and Ruslana firm-level data aggregator. Columns represent the following issues:
hh_name - the set of lowercased names (initial and transliterated) of a firm from HeadHunter platform;
rus_name - the set of lowercased names (initial and transliterated) of a firm from Ruslana platform;
J_M - the Jaccard similarity between two previous sets obtained with MinHash approximation (100 hash-functions) and converted into integer scale {0, 1, ..., 100};
d_H - the geographic distance between two firms in km;
match - the Boolean indication of the pair match (marked-up and manually validated);
ind - the Boolean indication of the same company industry based on the company and industry description similarity;
entity - the Boolean indication of the same company legal form based on the company keywords;
subs - the Boolean indication that at least one of company name formulation from one data source is a substring of another company name from another database;
sample - values of "train" or "test" indication training and test samples.
创建时间:
2022-10-23



