company_names_data
收藏Figshare2022-10-23 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/company_names_data/21385260
下载链接
链接失效反馈官方服务:
资源简介:
The data contain a sample of 1,597,336 pairs of firms matched from two data sources: HeadHunter job board platform and Ruslana firm-level data aggregator. Columns represent the following issues: hh_name - the set of lowercased names (initial and transliterated) of a firm from HeadHunter platform; rus_name - the set of lowercased names (initial and transliterated) of a firm from Ruslana platform; J_M - the Jaccard similarity between two previous sets obtained with MinHash approximation (100 hash-functions) and converted into integer scale {0, 1, ..., 100}; d_H - the geographic distance between two firms in km; match - the Boolean indication of the pair match (marked-up and manually validated); ind - the Boolean indication of the same company industry based on the company and industry description similarity; entity - the Boolean indication of the same company legal form based on the company keywords; subs - the Boolean indication that at least one of company name formulation from one data source is a substring of another company name from another database; sample - values of "train" or "test" indication training and test samples.
创建时间:
2022-10-23



