ielabgroup/stella_trec24_biogen_embedding

Name: ielabgroup/stella_trec24_biogen_embedding
Creator: ielabgroup
Published: 2024-11-27 03:53:02
License: 暂无描述

Hugging Face2024-11-27 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/ielabgroup/stella_trec24_biogen_embedding

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含TREC24 BioGen PubMed语料库和测试查询的Stella_en_1.5B_v5嵌入。语料库最初包含20723868个来自TREC BioGen的唯一PMID样本，随后移除了摘要为空的样本，最终得到17801589个样本。语料库的输入文本为标题加摘要（以空格分隔），查询提示的输入文本为`Instruct: Given a medical query, retrieve documents that answer the query. Query: {query}`。数据集分为两个部分：corpus和test_query，分别包含17801589和65个样本，每个样本包含id和embedding两个特征。

This dataset contains stella_en_1.5B_v5 embeddings of the TREC24 BioGen PubMed corpus and test queries. The corpus includes unique PMIDs extracted from TREC BioGen, with 17801589 samples remaining after processing. The test queries section contains 65 queries. The dataset features include id and embedding, where embedding is a float16 sequence of length 1024. The input text for the corpus is the title and abstract (space-separated) for the Stella encoder model, and the query prompt input text is a specific formatted string.

提供机构：

ielabgroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集