Low-Complexity Domains (LCDs) in UniProt Reference Proteomes

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/8155289

下载链接

链接失效反馈

官方服务：

资源简介：

This is a comprehensive dataset of low-complexity domains in UniProt reference proteomes. For the purposes of this dataset, LCDs were identified using the LCD-Composer algorithm with default parameters for each of the 20 canonical amino acids. These searches identify "primary" LCDs, defined as protein regions for which a single type of amino acid comprises at least 40% of the region. In addition, separate searches were performed to identify "secondary" LCDs, which are defined as regions for which a single type of amino acid comprises at least 40% of the region and a second type of amino acid comprises at least 20% of the same region. Note that secondary LCDs exhibit very strong spatial overlap with primary LCDs and may be considered, approximately speaking, a subset of primary LCDs. There are seven main components to this dataset: Primary and secondary LCDs for the original reference proteomes from UniProt (downloaded 8/22/2022). These data are found within four zipped archives ending in "_LCDs.zip", one for each domain of life (Archaea, Bacteria, Eukaryota, and Viruses). Within each zipped archive, results are contained in a pair of files for each organism. The start of the file name is the organism's UniProt ID. For each organism, the pair of files are: primary LCDs are contained within the file ending in "_LCDcomposer_RESULTS.tsv", whereas secondary LCDs are contained within the file ending in "_LCDcomposer_SecondaryLCDs_RESULTS.tsv". Reference proteomes analyzed for each organism are also provided in separate zipped archives, one for each domain of life. Primary and secondary LCDs for a scrambled version of each proteome mentioned above. These searches were performed using identical search parameters and are included for statistical comparisons. When scrambling the proteomes, each protein sequence was scrambled individually to maintain its amino acid composition. File formats are identical to those described above except that all files will have "SCRAMBLED" in the name to distinguish them from analyses of original (i.e. native) proteomes. The "SecondaryLCDs_by_LCDcategory.zip" archive contains all secondary LCDs from the original proteomes but parsed by LCD category rather than by organism. These LCDs are identical to those in #1 above but are provided in this format to aid those interested in specific types of LCDs and which organisms contain them. The "GOA_files.zip" archive contains gene ontology files necessary for reproducing analyses in Cascarina and Ross (2024). The "Pfam_Data.zip" archive contains files with Pfam annotations in LCD-containing proteins and Pfam clan information. These files are necessary for reproducing analyses in Cascarina and Ross (2024). The "Observed_vs_Scrambled_LCDfrequency_Statistics.zip" archive contains results of statistical analyses of LCD enrichment or depletion in native ("Observed") proteomes compared to scrambled proteomes. Enrichment is defined as the native proteome having more LCD-containing proteins for a particular LCD type compared to a scrambled version of that proteome. Depletion is defined as the native having containing fewer LCD-containing proteins for a particular LCD type compared to a scrambled version of that proteome. In cases were 0 instances of an LCD class occurred in the native proteome, scrambled proteome, or both, biased estimates for the natural log of the odds ratio ("lnOR") and p-value were calculated by first adding 1 to all cells in the contingency table. The "RandomlySelectedOrganisms.zip" archive contains LCD-Composer results for 50 randomly selected organisms from each domain of life with the window size and composition thresholds used during the LCD searches varied systematically.

创建时间：

2024-05-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集