Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA

收藏

Figshare2025-03-28 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Natural_language_processing_of_gene_descriptions_for_overrepresentation_analysis_with_GeneTEA/28635317

下载链接

链接失效反馈

官方服务：

资源简介：

Data and models supporting "Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA" (Boyle et al. 2025)File descriptions:GeneTEA.pkl, GeneTEA-yeast.pkl, PharmaTEA.pkl - pickled GeneTEA modelsFig2b/d.csv: Top terms in Figure 2 b/d.gProfiler_hsapiens_7-16-2025_4-35-07 PM__intersections.csv: g:GOSt results for Fig2b, downloaded from the g:Profiler site.gProfiler_hsapiens_7-16-2025_4-41-03 PM__intersections.csv: g:GOSt results for Fig2d, downloaded from the g:Profiler site.enrichr_sets_03_01_2025.csv: Enrichr database downloaded 3/1/2025, used for Figure 3 and S1.gene sets for connexin.gmt: Enrichr gene sets containing the term "connexin", downloaded from the Enrichr site.false_discoveries.csv: Benchmarking results for false discovery control in Figures 4 and S1.EF_hand_example.csv: Top terms and MedCPT scores for EF-hand example in Figure 4.[Hallmark or Experimentally Derived Queries]_scores.csv: Benchmarking results for [Hallmark or Experimentally Derived Queries] across joined top terms in Figures 4 and S2. The "joined_ranking" column corresponds to the MedCPT Relevance score across the top terms and "num_high_redundancy" contains the number of redundant term pairs.[Hallmark or Experimentally Derived Queries]_indiv.csv: Benchmarking results for [Hallmark or Experimentally Derived Queries] for each top term in Figures 4 and S2. The "indiv_ranking" column corresponds to the MedCPT Relevance score for a single term.Fig4_left/right: Examples of top terms shown in what is now Figure 5.gProfiler_hsapiens_3-13-2025_9-14-59 AM__intersections.csv: g:GOSt results for Fig4_left, downloaded from the g:Profiler site.gProfiler_hsapiens_2-11-2025_10-18-19 AM__intersections.csv: g:GOSt results for Fig4_right, downloaded from the g:Profiler site.

支撑《基于GeneTEA开展基因描述的自然语言处理以实现富集分析》（Boyle等，2025）研究的数据与模型文件说明： 1. GeneTEA.pkl、GeneTEA-yeast.pkl、PharmaTEA.pkl：经pickle序列化存储的GeneTEA模型 2. Fig2b/d.csv：图2b与图2d中的排名靠前富集项 3. gProfiler_hsapiens_7-16-2025_4-35-07 PM__intersections.csv：对应图2b的g:GOSt分析结果，从g:Profiler官网下载所得 4. gProfiler_hsapiens_7-16-2025_4-41-03 PM__intersections.csv：对应图2d的g:GOSt分析结果，从g:Profiler官网下载所得 5. enrichr_sets_03_01_2025.csv：2025年3月1日下载的Enrichr数据库，用于图3及补充图S1 6. gene sets for connexin.gmt：从Enrichr官网下载的包含"connexin（连接蛋白）"术语的Enrichr基因集 7. false_discoveries.csv：图4及补充图S1中错误发现率控制的基准测试结果 8. EF_hand_example.csv：图4中EF-hand（EF手型结构）示例的排名靠前富集项及MedCPT得分 9. [Hallmark or Experimentally Derived Queries]_scores.csv：针对图4及补充图S2中合并排名靠前富集项的[Hallmark或实验衍生查询]的基准测试结果。其中"joined_ranking"列对应各排名靠前富集项的MedCPT相关性得分，"num_high_redundancy"列统计冗余项对的数量 10. [Hallmark or Experimentally Derived Queries]_indiv.csv：针对图4及补充图S2中各单个排名靠前富集项的[Hallmark或实验衍生查询]的基准测试结果。其中"indiv_ranking"列对应单个富集项的MedCPT相关性得分 11. Fig4_left/right：对应当前图5的排名靠前富集项示例 12. gProfiler_hsapiens_3-13-2025_9-14-59 AM__intersections.csv：对应Fig4_left（即当前图5左半部分）的g:GOSt分析结果，从g:Profiler官网下载所得 13. gProfiler_hsapiens_2-11-2025_10-18-19 AM__intersections.csv：对应Fig4_right（即当前图5右半部分）的g:GOSt分析结果，从g:Profiler官网下载所得

创建时间：

2025-03-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集

© 2023-2025 上海数据发展科技有限责任公司版权所有

沪ICP备17003045号-15 沪公网安备31010402336585号

二维码

社区交流群

面向社区/商业的数据集话题

二维码

科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作