Bilateral Perisylvian Polymicrogyria
收藏NIAID Data Ecosystem2026-03-07 收录
下载链接:
https://figshare.com/articles/dataset/Bilateral_Perisylvian_Polymicrogyria/89645
下载链接
链接失效反馈官方服务:
资源简介:
The idea behind me posting data on here, is that anybody working on this particular disease can download the data (from myExperiment: http://www.myexperiment.org/packs/157) for free, and use it to help speed up the time-to-discovery or time-to-therapy. Anyone can download it - your mate down the local pub; people interested in the condition; DIY Biologists wanting to further the field quickly.
I don't want people to wait 10 years for progress. I want people to know about it now, and then do something about it !!
The data contains lists of candidate genes, and pathways, within a chromosome regions (linked to this disease), and the SNP within this region. All it requires is someone to go through, with 'some' knowledge of the area, and try to find the genes or pathways that may cause the disease. With this being a congenital brain abnormality, research may want to be focussed on neural tube development; though my expertise in this area is limited at best. You may find something else far more interesting - like salivary secretion pathways and the fact that people with this condition can have trouble swallowing. For example !!
I've also applied some text mining over the pathways (linked to the genes) and the genes themselves, to rank them in order of most likely to be linked to the disease (through occurrence of common words in abstracts).
I was going to publish this data as a paper, but then thought I would stick it to 'the man', namely the scientific infrastructure we now have - making me write papers, slowing down research, and creating a system where nobody shares data quickly.
The reason behind this is to highlight the fact that publishing in this manner is equally as valid as a peer-review publication on a CV, and that it may also speed up the discovery of novel genes linked to this disease, thereby increasing the chances of finding some therapeutic treatments. If I am able to explain that people are actively using this data in their research, when the time comes for more interviews, then I'm quite sure that having this on my CV would work just as well.
Please tell people about this research, and the reasons behind it. I should probably note that no University resources were used in the generation of this data.
This data was derived from a series of Taverna workflows, that can be found within the myExperiment pack: http://www.myexperiment.org/packs/157. The SNP data was obtained via an export from BioMart. For this I selected the Homo sapiens variation database, and then chose the filters for the chromosome regions notyed by Villard et al., (2002).
The 'Pathways and Gene annotations for Human QTL region' workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in Human, Homo sapiens. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to search for pathways in the KEGG pathway database. All output from this workflow forms a flatfile database, which can be used to cross-reference data between the various output files.
The workflow 'Pathway and Gene to PubMed' takes in a list of gene names and KEGG pathway descriptions, and searches the PubMed database for corresponding articles. Any matches to these are then retrieved as textual abstracts. These abstracts are then used to calculate a cosine vector space between two sets of corpora (gene and phenotype, or pathway and phenotype).
The workflow counts the number of articles in the pubmed database in which each term occurs, and identifies the total number of articles in the entire PubMed database. It also identified the total number of articles within pubmed so that a term enrichment score may be calculated. The workflow also takes in a document containing abstracts that are related to a particular phenotype. Scientiifc terms are then extracted from this text and given a weighting according to the number of terms that appear in the document. This represents the phneotype concept profile. The higher the value the better the score. This is given as: X = log((a / b) / (c / d)) where: a = number of occurnaces of individual terms in phenotype corpus b = number of abstracts in entire phenotype corpus c = number of occurnaces of individual terms in entire pubmed d = number of articles in entire pubmed Once this has been created, the pathways obtained from the QTL workflow are analysed.
The (unweighted) phenotype terms are searched in the gene and pathway corpus. This will determine if the phenotype shares any common terms with the given gene or pathway; which can then be used to correlate the gene/pathway with the phenotype.
A cosine vector score is used to calculate the correlation between each gene/pathway to the phenotype. The higher the score (towards 1) the better the correlation.
Each phenotype term is also assigned a wieght: Y = log((e / f) / (c /d)) where: a = number of occurnaces of individual terms in pathway corpus, b = number of abstracts in pathway corpus (per pathway), c = number of occurances of individual terms in entire pubmed, d = number of articles in entire pubmed.
The weighted terms are then given an enrichment score. This is the total of: X + Y. This gives the link between the pathway and the phenotype a score / significance value. The higher the score the more "appropriate/interesting" the link between the pathway and the phenotype. The terms are also ranked according to the number of pathways which have been given a weight. This is calculated as: W = ( X + Y). The higher the value the better the score.
Further details of this text mining methodology will be submitted for publication in a peer-review journal.
创建时间:
2012-01-11



