Code and Data for "How Large is Large Enough?"

Mendeley Data2024-03-27 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/xsy73v92wv

下载链接

链接失效反馈

官方服务：

资源简介：

Here we share the data used in our studis of the minimum representatitve size of subset and the code of the metric. Bassically, when two sets of data points, such as recieved citations of all papers of two journals within certian time period, are compared, usually people compare the mean, for example, the journal impact factor (JIF). We may compare the two set directly by taking one sample from each set and then compare the two samples, and count this ratio of sample from set one is bigger than sample from set two. It is quite possible that when the mean value of set one is larger than that of set two, the above ratio can still be very low, especially when the two sets have large variances, ie when the summation of the variance is close or even larger than the difference between the means. In that case, there is a large overlap between the two data sets. We find a way to reduce the varaince, thus also reduce the overlap: By taking a set of K1 samples fromt set one and K2 samples from set 2, and calculate and compare the average of the two subsets. Based on this observation, we find that as long as K1 and K2 are large enough, then the ratio of the K1-average of set one is larger than the K2-average of set two can be quite large, as contrast to the original low ratio of the sinle-sample average of the first is large than that of the second set. We then define the necessary size of each set need for a reliable comparison of the two sets to be the minimum representative size of the set, and apply it to a set of journals. Here we provide data and the code. There are examples provides in the comments in the code. The Python program $PrMultiSamComp\left(X, Y, K_{X}, K_{Y}, Pr, K2PorP2K\right)$, implementing the metric in Python. Basically, given the two set $X, Y$ and the threshold probability $Pr$, the program calculate $K_{X}, K_{Y}$ with the flag value $K2PorP2K=0$ and given the two set $X, Y$ and the size of re-sampling subsets $K_{X}, K_{Y}$, the program calculate $Pr$ with $K2PorP2K=1$.

本研究共享了用于探究子集最小代表性样本量的相关数据，以及对应评估指标的代码。通常而言，当对两组数据点进行比较时——例如某一时间段内两份期刊所有论文的总被引频次——学界常通过均值进行对比，比如期刊影响因子（Journal Impact Factor, JIF）。我们也可直接从两组数据中各抽取一个样本进行比较，并统计第一组样本取值大于第二组样本的比例。即便第一组数据的均值高于第二组，上述比例仍有可能极低——尤其当两组数据的方差较大时，即两组方差之和接近甚至大于两组均值之差的情况，此时两组数据集存在较大的重叠区域。对此，我们提出了一种降低方差、进而缩小重叠区域的方法：分别从第一组和第二组数据中抽取$K_1$个和$K_2$个样本，计算并比较两个子集的均值。基于该观测结果，我们发现：只要$K_1$和$K_2$足够大，第一组的$K_1$个样本均值大于第二组的$K_2$个样本均值的比例就会显著提升，与仅抽取单个样本时的低比例形成鲜明对比。随后我们将能够可靠比较两组数据所需的每组样本量定义为该组的最小代表性样本量，并将该方法应用于期刊数据集。本文共享相关数据与代码，代码注释中包含使用示例。本研究提供了用Python实现该评估指标的程序`PrMultiSamComp(X, Y, K_X, K_Y, Pr, K2PorP2K)`。具体而言，当标记值`K2PorP2K=0`时，程序接收两组数据$X$、$Y$与阈值概率$Pr$，计算得到$K_X$和$K_Y$；当`K2PorP2K=1`时，程序接收两组数据$X$、$Y$以及重采样子集的大小$K_X$和$K_Y$，计算得到阈值概率$Pr$。

创建时间：

2024-01-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集