five

Code and Data for "How Large is Large Enough?"

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://data.mendeley.com/datasets/xsy73v92wv
下载链接
链接失效反馈
官方服务:
资源简介:
Here we share the data used in our studis of the minimum representatitve size of subset and the code of the metric. Bassically, when two sets of data points, such as recieved citations of all papers of two journals within certian time period, are compared, usually people compare the mean, for example, the journal impact factor (JIF). We may compare the two set directly by taking one sample from each set and then compare the two samples, and count this ratio of sample from set one is bigger than sample from set two. It is quite possible that when the mean value of set one is larger than that of set two, the above ratio can still be very low, especially when the two sets have large variances, ie when the summation of the variance is close or even larger than the difference between the means. In that case, there is a large overlap between the two data sets. We find a way to reduce the varaince, thus also reduce the overlap: By taking a set of K1 samples fromt set one and K2 samples from set 2, and calculate and compare the average of the two subsets. Based on this observation, we find that as long as K1 and K2 are large enough, then the ratio of the K1-average of set one is larger than the K2-average of set two can be quite large, as contrast to the original low ratio of the sinle-sample average of the first is large than that of the second set. We then define the necessary size of each set need for a reliable comparison of the two sets to be the minimum representative size of the set, and apply it to a set of journals. Here we provide data and the code. There are examples provides in the comments in the code. The Python program $PrMultiSamComp\left(X, Y, K_{X}, K_{Y}, Pr, K2PorP2K\right)$, implementing the metric in Python. Basically, given the two set $X, Y$ and the threshold probability $Pr$, the program calculate $K_{X}, K_{Y}$ with the flag value $K2PorP2K=0$ and given the two set $X, Y$ and the size of re-sampling subsets $K_{X}, K_{Y}$, the program calculate $Pr$ with $K2PorP2K=1$.
创建时间:
2018-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作