company_names_data
收藏DataCite Commons2022-10-23 更新2024-07-29 收录
下载链接:
https://figshare.com/articles/dataset/company_names_data/21385260
下载链接
链接失效反馈官方服务:
资源简介:
The data contain a sample of 1,597,336 pairs of firms matched from two data sources: HeadHunter job board platform and Ruslana firm-level data aggregator. Columns represent the following issues: <em>hh_name</em> - the set of lowercased names (initial and transliterated) of a firm from HeadHunter platform; <em>rus_name</em> - the set of lowercased names (initial and transliterated) of a firm from Ruslana platform; <em>J_M</em> - the Jaccard similarity between two previous sets obtained with MinHash approximation (100 hash-functions) and converted into integer scale {0, 1, ..., 100}; <em>d_H</em> - the geographic distance between two firms in km; <em>match</em> - the Boolean indication of the pair match (marked-up and manually validated); <em>ind</em> - the Boolean indication of the same company industry based on the company and industry description similarity; <em>entity</em> - the Boolean indication of the same company legal form based on the company keywords; <em>subs</em> - the Boolean indication that at least one of company name formulation from one data source is a substring of another company name from another database; <em>sample</em> - values of "train" or "test" indication training and test samples.
提供机构:
figshare
创建时间:
2022-10-23



