five

valiantlynxz/tripletex-tool-embeddings

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/valiantlynxz/tripletex-tool-embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 configs: - config_name: embeddings data_files: "embeddings.parquet" default: true - config_name: tools data_files: "tools.parquet" task_categories: - feature-extraction - text-classification language: - "en" - "no" tags: - embeddings - openapi - tripletex - accounting - api-tools - rag - lancedb - pydantic size_categories: - n<1K --- # Tripletex API Tool Embeddings Pre-computed embeddings for 800 Tripletex accounting API tools, extracted from the OpenAPI 3.0.1 spec and embedded with Google `gemini-embedding-001` (3072 dimensions). Built for RAG-based tool filtering in the [AI Accounting Agent](https://github.com/kuben-labs/nmai) competition project. ## Quick Start ```python from datasets import load_dataset # Full embeddings (800 tools, 3072-dim vectors) — ready for RAG ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") # Lightweight: tool metadata only, no embedding vectors ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools") ``` ## Configurations | Config | Default | Columns | Size | Use case | |--------|---------|---------|------|----------| | `embeddings` | Yes | name, description, parameters, embedding | ~9 MB | RAG search, vector index | | `tools` | | name, description, parameters | ~50 KB | Browsing, filtering, analysis | ## Schema ### `embeddings` config ```python ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") example = ds["train"][0] example["name"] # "AccountantDashboardNews_get" example["description"] # "Get public news articles" example["parameters"] # '{"from": "integer", "count": "integer", ...}' (JSON string) example["embedding"] # [3072 floats] — gemini-embedding-001 vector ``` ### `tools` config ```python ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools") example = ds["train"][0] example["name"] # "AccountantDashboardNews_get" example["description"] # "Get public news articles" example["parameters"] # '{"from": "integer", "count": "integer", ...}' (JSON string) ``` ## Data Summary - **800 API tools** from the Tripletex accounting API (OpenAPI 3.0.1) - **3072-dimensional embeddings** via Google `gemini-embedding-001` - **Parameters** stored as JSON strings mapping param names to types - **Source:** `openapi.json` included in this repo (3.5 MB, 546 paths, 2167 schemas) ## Using with LanceDB ```python from datasets import load_dataset import lancedb ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") # Convert to LanceDB db = lancedb.connect(".tool_embeddings") records = [ { "name": row["name"], "description": row["description"], "parameters": row["parameters"], "embedding": row["embedding"], } for row in ds["train"] ] table = db.create_table("tools", data=records, mode="overwrite") # Search results = table.search(query_embedding).limit(100).to_list() ``` ## Using with FAISS ```python from datasets import load_dataset import numpy as np import faiss ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") embeddings = np.array(ds["train"]["embedding"], dtype=np.float32) index = faiss.IndexFlatIP(3072) faiss.normalize_L2(embeddings) index.add(embeddings) # Search query = np.array([query_embedding], dtype=np.float32) faiss.normalize_L2(query) distances, indices = index.search(query, k=100) tool_names = [ds["train"][int(i)]["name"] for i in indices[0]] ``` ## Regenerating Embeddings The `scripts/` directory contains the original embedding pipeline: - `scripts/embeddings.py` — Google Gemini embedding provider - `scripts/rag_tool_filter.py` — OpenAPI-to-embedding pipeline + LanceDB vector store ```python # Requires: google-genai, lancedb # Requires: GCP_API_KEY environment variable from scripts.embeddings import get_embedding_provider from scripts.rag_tool_filter import ToolEmbedder, ToolVectorStore, index_openapi_tools import json, asyncio with open("openapi.json") as f: spec = json.load(f) provider = get_embedding_provider() embedder = ToolEmbedder(provider) store = ToolVectorStore(".tool_embeddings") asyncio.run(index_openapi_tools(spec, store, embedder)) ``` ## Repo Structure ``` tripletex-tool-embeddings/ ├── README.md ├── embeddings.parquet # 800 tools with 3072-dim embeddings ├── tools.parquet # 800 tools metadata only (lightweight) ├── openapi.json # Source Tripletex OpenAPI 3.0.1 spec (3.5 MB) └── scripts/ ├── embeddings.py # Google Gemini embedding provider └── rag_tool_filter.py # OpenAPI extraction + LanceDB indexing ``` ## Source Project Part of the [nmai](https://github.com/kuben-labs/nmai) project — `ai-accounting-agent/`.

--- 许可证:CC-BY-NC-4.0 配置项: - 配置名称:embeddings 数据文件:embeddings.parquet 默认启用:是 - 配置名称:tools 数据文件:tools.parquet 任务类别: - 特征提取 - 文本分类 语言: - 英语(en) - 挪威语(no) 标签: - 嵌入向量 - OpenAPI - Tripletex - 会计 - API工具 - 检索增强生成(RAG) - LanceDB - Pydantic 数据规模类别: - 样本数小于1000(n<1K) --- # Tripletex API工具嵌入向量数据集 本数据集包含从OpenAPI 3.0.1规范中提取的800个Tripletex会计API工具的预计算嵌入向量,嵌入过程使用Google的`gemini-embedding-001`模型,向量维度为3072。 本数据集专为[AI会计智能体(AI Accounting Agent)](https://github.com/kuben-labs/nmai)竞赛项目中的检索增强生成(RAG)工具筛选任务构建。 ## 快速入门 python from datasets import load_dataset # 完整嵌入向量集(800个工具,3072维向量)——可直接用于RAG检索任务 ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") # 轻量版本:仅包含工具元数据,无嵌入向量 ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools") ## 配置项 | 配置名称 | 是否默认 | 字段列表 | 大小 | 应用场景 | |--------|---------|---------|------|----------| | `embeddings` | 是 | 名称、描述、参数、嵌入向量 | ~9 MB | RAG检索、向量索引构建 | | `tools` | 否 | 名称、描述、参数 | ~50 KB | 工具浏览、筛选、分析 | ## 数据结构 ### `embeddings` 配置 python ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") example = ds["train"][0] example["name"] # 示例值:"AccountantDashboardNews_get" example["description"] # 示例描述:"获取公共新闻文章" example["parameters"] # 格式为JSON字符串,例如:'{"from": "integer", "count": "integer", ...}' example["embedding"] # 由3072个浮点数组成的数组 — Google `gemini-embedding-001` 生成的向量 ### `tools` 配置 python ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools") example = ds["train"][0] example["name"] # 示例值:"AccountantDashboardNews_get" example["description"] # 示例描述:"获取公共新闻文章" example["parameters"] # 格式为JSON字符串,例如:'{"from": "integer", "count": "integer", ...}' ## 数据概览 - **800个API工具**:源自Tripletex会计API(OpenAPI 3.0.1规范) - **3072维嵌入向量**:通过Google `gemini-embedding-001` 模型生成 - **参数字段**:以JSON字符串形式存储,映射参数名称与其数据类型 - **数据源**:本仓库内置的`openapi.json`文件(大小3.5 MB,包含546个接口路径、2167个数据模式) ## 与LanceDB结合使用 python from datasets import load_dataset import lancedb ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") # 转换为LanceDB向量存储 db = lancedb.connect(".tool_embeddings") records = [ { "name": row["name"], "description": row["description"], "parameters": row["parameters"], "embedding": row["embedding"], } for row in ds["train"] ] table = db.create_table("tools", data=records, mode="overwrite") # 向量检索 results = table.search(query_embedding).limit(100).to_list() ## 与FAISS结合使用 python from datasets import load_dataset import numpy as np import faiss ds = load_dataset("valiantlynxz/tripletex-tool-embeddings") embeddings = np.array(ds["train"]["embedding"], dtype=np.float32) index = faiss.IndexFlatIP(3072) faiss.normalize_L2(embeddings) index.add(embeddings) # 向量检索 query = np.array([query_embedding], dtype=np.float32) faiss.normalize_L2(query) distances, indices = index.search(query, k=100) tool_names = [ds["train"][int(i)]["name"] for i in indices[0]] ## 重新生成嵌入向量 `scripts/` 目录包含原始的嵌入向量生成流水线: - `scripts/embeddings.py` — Google Gemini嵌入向量生成工具 - `scripts/rag_tool_filter.py` — OpenAPI规范转嵌入向量流水线 + LanceDB向量存储实现 python # 依赖库:google-genai、lancedb # 环境变量:需配置GCP_API_KEY from scripts.embeddings import get_embedding_provider from scripts.rag_tool_filter import ToolEmbedder, ToolVectorStore, index_openapi_tools import json, asyncio with open("openapi.json") as f: spec = json.load(f) provider = get_embedding_provider() embedder = ToolEmbedder(provider) store = ToolVectorStore(".tool_embeddings") asyncio.run(index_openapi_tools(spec, store, embedder)) ## 仓库结构 tripletex-tool-embeddings/ ├── README.md ├── embeddings.parquet # 包含800个工具的3072维嵌入向量文件 ├── tools.parquet # 仅包含800个工具的元数据(轻量版本) ├── openapi.json # Tripletex原始OpenAPI 3.0.1规范文件(大小3.5 MB) └── scripts/ ├── embeddings.py # Google Gemini嵌入向量生成工具 └── rag_tool_filter.py # OpenAPI规范提取与LanceDB索引构建脚本 ## 源项目 本数据集是[nmai](https://github.com/kuben-labs/nmai)项目的一部分,对应`ai-accounting-agent/`模块。
提供机构:
valiantlynxz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作