Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA

收藏

DataCite Commons2025-10-10 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/Natural_language_processing_of_gene_descriptions_for_overrepresentation_analysis_with_GeneTEA/28635317/2

下载链接

链接失效反馈

官方服务：

资源简介：

Data and models supporting "<b>Natural language processing of gene descriptions for overrepresentation analysis with GeneTEA</b>" (Boyle et al. 2025)File descriptions:<b>GeneTEA.pkl, GeneTEA-yeast.pkl, PharmaTEA.pkl</b> - pickled GeneTEA models<b>Fig2b/d.csv:</b> Top terms in Figure 2 b/d.<b>gProfiler_hsapiens_7-16-2025_4-35-07 PM__intersections.csv</b><b>:</b> g:GOSt results for Fig2b, downloaded from the g:Profiler site.<b>gProfiler_hsapiens_7-16-2025_4-41-03 PM__intersections.csv: </b>g:GOSt results for Fig2d, downloaded from the g:Profiler site.<b>enrichr_sets_03_01_2025.csv: </b>Enrichr<b> </b>database downloaded 3/1/2025, used for Figure 3 and S1.<b>gene sets for connexin</b><b>.gmt</b>: Enrichr gene sets containing the term "connexin", downloaded from the Enrichr site.<b>false_discoveries</b><b>.csv</b>: Benchmarking results for false discovery control in Figures 4 and S1.<b>EF_hand_example.csv</b>: Top terms and MedCPT scores for EF-hand example in Figure 4.<b>[Hallmark or Experimentally Derived Queries]_</b><b>score</b><b>s.csv</b>: Benchmarking results for [Hallmark or Experimentally Derived Queries] across joined top terms in Figures 4 and S2. The<i> "</i>joined_ranking" column corresponds to the MedCPT Relevance score across the top terms and "num_high_redundancy" contains the number of redundant term pairs.<b>[</b><b>Hallmark or Experimentally Derived Queries</b><b>]_</b><b>indiv.csv</b>: Benchmarking results for [Hallmark or Experimentally Derived Queries] for each top term in Figures 4 and S2. The<i> "</i>indiv<i>_</i>ranking" column corresponds to the MedCPT Relevance score for a single term.<b>Fig4_left/right</b>: Examples of top terms shown in what is now Figure 5.<b>gProfiler_hsapiens_3-13-2025_9-14-59 AM__intersections.csv: </b>g:GOSt results for Fig4_left, downloaded from the g:Profiler site.<b>gProfiler_hsapiens_2-11-2025_10-18-19 AM__intersections.csv</b>: g:GOSt results for Fig4_right, downloaded from the g:Profiler site.<br>

支持“结合GeneTEA（GeneTEA）开展富集分析的基因描述自然语言处理”研究（Boyle等，2025）的数据集与模型。文件说明如下： 1. GeneTEA.pkl、GeneTEA-yeast.pkl、PharmaTEA.pkl：经pickle序列化存储的GeneTEA模型文件。 2. Fig2b/d.csv：对应图2b与图2d的富集前导术语文件。 3. gProfiler_hsapiens_7-16-2025_4-35-07 PM__intersections.csv：从g:Profiler网站下载的、用于图2b的g:GOSt分析结果文件。 4. gProfiler_hsapiens_7-16-2025_4-41-03 PM__intersections.csv：从g:Profiler网站下载的、用于图2d的g:GOSt分析结果文件。 5. enrichr_sets_03_01_2025.csv：2025年3月1日下载的Enrichr数据库文件，用于生成图3与补充图S1。 6. gene sets for connexin.gmt：从Enrichr网站下载的、包含"connexin（连接蛋白）"术语的Enrichr基因集文件。 7. false_discoveries.csv：用于图4与补充图S1的假发现率控制基准测试结果文件。 8. EF_hand_example.csv：对应图4中EF-hand示例的前导术语与MedCPT得分文件。 9. [Hallmark or Experimentally Derived Queries]_scores.csv：用于图4与补充图S2的、针对[标志性查询或实验衍生查询]的联合前导术语基准测试结果文件。其中"joined_ranking"列对应各前导术语的MedCPT相关性得分，"num_high_redundancy"列则包含冗余术语对的数量。 10. [Hallmark or Experimentally Derived Queries]_indiv.csv：用于图4与补充图S2的、针对[标志性查询或实验衍生查询]的单前导术语基准测试结果文件。其中"indiv_ranking"列对应单个术语的MedCPT相关性得分。 11. Fig4_left/right：对应现已更新为图5的前导术语示例文件。 12. gProfiler_hsapiens_3-13-2025_9-14-59 AM__intersections.csv：从g:Profiler网站下载的、用于图4左半部分的g:GOSt分析结果文件。 13. gProfiler_hsapiens_2-11-2025_10-18-19 AM__intersections.csv：从g:Profiler网站下载的、用于图4右半部分的g:GOSt分析结果文件。

提供机构：

figshare

创建时间：

2025-10-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集

© 2023-2025 上海数据发展科技有限责任公司版权所有

沪ICP备17003045号-15 沪公网安备31010402336585号

二维码

社区交流群

面向社区/商业的数据集话题

二维码

科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作