five

Datasets: Similarity Indices of Academic and Professional Trajectories Through Data Vectorization

收藏
Figshare2026-03-03 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Datasets_Similarity_Indices_of_Academic_and_Professional_Trajectories_Through_Data_Vectorization/31452007
下载链接
链接失效反馈
官方服务:
资源简介:
DescriptionThis repository contains the vectorized datasets used in the similarity analysis described in Section "Key variables and normalizatio". Each file corresponds to a specific preprocessing configuration applied to the same set of key academic and professional trajectory variables.Variables included in all filesAll datasets contain the following variables used for vector construction:Academic trajectory variablesYS(i): Years of individual studies (integer, n > 0)AL(i): Individual academic levels (integer, n > 0)MAL(i): Maximum individual academic level (categorical: {1,…,6})IS(i): Individual informal studies (integer, n > 0)Professional trajectory variablesED(i): Experience duration (float, m > 0)JR(i): Relative job rotation (float, m > 0)σJR(i): Relative job rotation (standard deviation) (float, m > 0)JM(i): Relative job mobility (float, m > 0)σJT(i): Job tenure (standard deviation) (float, m > 0)Preprocessing configurationsTo ensure comparability across heterogeneous dimensions and robustness to outliers, a differentiated preprocessing strategy was applied depending on the similarity index.1. Raw configuration (raw)Original variable values without scaling.Used as baseline representation.Files:raw_samples_data.xlsxraw_full_dataset.csv2. Euclidean configuration (robust_scaler)RobustScaler was applied to mitigate the influence of outliers while preserving relative magnitudes across profiles.Absolute differences in the scaled space directly affect Euclidean distance.This configuration emphasizes discrepancies in levels and dispersion across metrics.Files:robust_scaler_samples_data.xlsxrobust_scaler_full_dataset.csv3. Cosine configuration (norma_l2)RobustScaler was first applied, followed by L2 normalization.Vectors are rescaled to unit length.Cosine similarity captures directional pattern affinity rather than magnitude differences.This configuration prioritizes relative structure across variables even when overall scale differs.Files:norma_l2_samples_data.xlsxnorma_l2_full_dataset.csvFile structureSamples files (.xlsx): Subset used for illustrative and validation analysis.Full dataset files (.csv): Complete dataset used for similarity computation and clustering procedures.
创建时间:
2026-03-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作