RPKB
收藏R-Package Knowledge Base (RPKB) 数据集概述
基本信息
- 许可证: Apache-2.0
- 任务类别: 文本检索、问答
- 语言: 英语
- 标签: R语言、ChromaDB、工具检索、数据科学、LLM智能体
- 规模类别: n<10K
数据集简介
该数据集是论文《DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval》的官方预计算ChromaDB向量数据库。它包含8,191个高质量R函数,这些函数精心选自CRAN,并附有提取的统计元数据(数据配置文件)以及由**DARE模型**生成的预计算嵌入。
数据库概览
- 数据库引擎: ChromaDB
- 文档总数: 8,191个R函数
- 嵌入模型:
Stephen-SMJ/DARE-R-Retriever - 主要用例: 为在R中执行数据科学和统计工作流的LLM智能体进行工具检索。
使用方法
1. 安装依赖
bash pip install huggingface_hub chromadb sentence-transformers
2. 下载RPKB并连接
Python from huggingface_hub import snapshot_download import chromadb
1. 从Hugging Face下载数据库文件夹
db_path = snapshot_download( repo_id="Stephen-SMJ/RPKB", repo_type="dataset", allow_patterns="RPKB/*" )
2. 连接到本地ChromaDB实例
client = chromadb.PersistentClient(path=f"{db_path}/RPKB")
3. 访问特定集合
collection = client.get_collection(name="inference")
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
3. 执行R包检索
Python from sentence_transformers import SentenceTransformer
加载DARE嵌入模型
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retriever")
使用数据约束构建查询
user_query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the first value of the estimated scores (est_a) for the very first region identified."
生成嵌入
query_embedding = model.encode(user_query).tolist()
在数据库中使用硬过滤器搜索
results = collection.query( query_embeddings=[query_embedding], n_results=3, include=["metadatas", "distances", "documents"] )
显示Top-1结果
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])



