Data from: Size distribution of function-based human gene sets and the split-merge model

DataONE2016-07-05 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.

已知旁系同源基因（paralogues）——由祖先复制产生的基因家族——的大小服从幂律分布（power-law distribution）。我们考察了按相似功能分组或共享共同属性的基因集（gene sets）或基因家族的大小分布。人类基因命名委员会（Human Gene Nomenclature Committee, HGNC）收录的基因集的大小分布偏离幂律分布，且可通过贝塔秩函数（beta rank function）获得更优拟合效果。我们提出一种通过分裂与合并操作的组合来打破幂律大小分布的简单机制：将规模最大的基因集拆分为两个以对应亚功能类别，同时将小部分其他基因集合并为更大的基因集，这是因为可能会发现新的共同功能主题。这类操作在基因集编目员的日常工作中并不罕见。模拟实验表明，反复执行上述操作会改变Ensembl数据库中旁系同源基因的大小分布，最终得到可通过贝塔秩函数拟合的分布形式。我们还通过转录因子（transcription factors）与药物靶基因（drug target genes）在HGNC基因家族中的分布案例，进一步展示了贝塔秩函数的实际应用。

创建时间：

2016-07-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集