valiantlynxz/tripletex-tool-embeddings
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/valiantlynxz/tripletex-tool-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
configs:
- config_name: embeddings
data_files: "embeddings.parquet"
default: true
- config_name: tools
data_files: "tools.parquet"
task_categories:
- feature-extraction
- text-classification
language:
- "en"
- "no"
tags:
- embeddings
- openapi
- tripletex
- accounting
- api-tools
- rag
- lancedb
- pydantic
size_categories:
- n<1K
---
# Tripletex API Tool Embeddings
Pre-computed embeddings for 800 Tripletex accounting API tools, extracted from the OpenAPI 3.0.1 spec and embedded with Google `gemini-embedding-001` (3072 dimensions).
Built for RAG-based tool filtering in the [AI Accounting Agent](https://github.com/kuben-labs/nmai) competition project.
## Quick Start
```python
from datasets import load_dataset
# Full embeddings (800 tools, 3072-dim vectors) — ready for RAG
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
# Lightweight: tool metadata only, no embedding vectors
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools")
```
## Configurations
| Config | Default | Columns | Size | Use case |
|--------|---------|---------|------|----------|
| `embeddings` | Yes | name, description, parameters, embedding | ~9 MB | RAG search, vector index |
| `tools` | | name, description, parameters | ~50 KB | Browsing, filtering, analysis |
## Schema
### `embeddings` config
```python
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
example = ds["train"][0]
example["name"] # "AccountantDashboardNews_get"
example["description"] # "Get public news articles"
example["parameters"] # '{"from": "integer", "count": "integer", ...}' (JSON string)
example["embedding"] # [3072 floats] — gemini-embedding-001 vector
```
### `tools` config
```python
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools")
example = ds["train"][0]
example["name"] # "AccountantDashboardNews_get"
example["description"] # "Get public news articles"
example["parameters"] # '{"from": "integer", "count": "integer", ...}' (JSON string)
```
## Data Summary
- **800 API tools** from the Tripletex accounting API (OpenAPI 3.0.1)
- **3072-dimensional embeddings** via Google `gemini-embedding-001`
- **Parameters** stored as JSON strings mapping param names to types
- **Source:** `openapi.json` included in this repo (3.5 MB, 546 paths, 2167 schemas)
## Using with LanceDB
```python
from datasets import load_dataset
import lancedb
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
# Convert to LanceDB
db = lancedb.connect(".tool_embeddings")
records = [
{
"name": row["name"],
"description": row["description"],
"parameters": row["parameters"],
"embedding": row["embedding"],
}
for row in ds["train"]
]
table = db.create_table("tools", data=records, mode="overwrite")
# Search
results = table.search(query_embedding).limit(100).to_list()
```
## Using with FAISS
```python
from datasets import load_dataset
import numpy as np
import faiss
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
embeddings = np.array(ds["train"]["embedding"], dtype=np.float32)
index = faiss.IndexFlatIP(3072)
faiss.normalize_L2(embeddings)
index.add(embeddings)
# Search
query = np.array([query_embedding], dtype=np.float32)
faiss.normalize_L2(query)
distances, indices = index.search(query, k=100)
tool_names = [ds["train"][int(i)]["name"] for i in indices[0]]
```
## Regenerating Embeddings
The `scripts/` directory contains the original embedding pipeline:
- `scripts/embeddings.py` — Google Gemini embedding provider
- `scripts/rag_tool_filter.py` — OpenAPI-to-embedding pipeline + LanceDB vector store
```python
# Requires: google-genai, lancedb
# Requires: GCP_API_KEY environment variable
from scripts.embeddings import get_embedding_provider
from scripts.rag_tool_filter import ToolEmbedder, ToolVectorStore, index_openapi_tools
import json, asyncio
with open("openapi.json") as f:
spec = json.load(f)
provider = get_embedding_provider()
embedder = ToolEmbedder(provider)
store = ToolVectorStore(".tool_embeddings")
asyncio.run(index_openapi_tools(spec, store, embedder))
```
## Repo Structure
```
tripletex-tool-embeddings/
├── README.md
├── embeddings.parquet # 800 tools with 3072-dim embeddings
├── tools.parquet # 800 tools metadata only (lightweight)
├── openapi.json # Source Tripletex OpenAPI 3.0.1 spec (3.5 MB)
└── scripts/
├── embeddings.py # Google Gemini embedding provider
└── rag_tool_filter.py # OpenAPI extraction + LanceDB indexing
```
## Source Project
Part of the [nmai](https://github.com/kuben-labs/nmai) project — `ai-accounting-agent/`.
---
许可证:CC-BY-NC-4.0
配置项:
- 配置名称:embeddings
数据文件:embeddings.parquet
默认启用:是
- 配置名称:tools
数据文件:tools.parquet
任务类别:
- 特征提取
- 文本分类
语言:
- 英语(en)
- 挪威语(no)
标签:
- 嵌入向量
- OpenAPI
- Tripletex
- 会计
- API工具
- 检索增强生成(RAG)
- LanceDB
- Pydantic
数据规模类别:
- 样本数小于1000(n<1K)
---
# Tripletex API工具嵌入向量数据集
本数据集包含从OpenAPI 3.0.1规范中提取的800个Tripletex会计API工具的预计算嵌入向量,嵌入过程使用Google的`gemini-embedding-001`模型,向量维度为3072。
本数据集专为[AI会计智能体(AI Accounting Agent)](https://github.com/kuben-labs/nmai)竞赛项目中的检索增强生成(RAG)工具筛选任务构建。
## 快速入门
python
from datasets import load_dataset
# 完整嵌入向量集(800个工具,3072维向量)——可直接用于RAG检索任务
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
# 轻量版本:仅包含工具元数据,无嵌入向量
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools")
## 配置项
| 配置名称 | 是否默认 | 字段列表 | 大小 | 应用场景 |
|--------|---------|---------|------|----------|
| `embeddings` | 是 | 名称、描述、参数、嵌入向量 | ~9 MB | RAG检索、向量索引构建 |
| `tools` | 否 | 名称、描述、参数 | ~50 KB | 工具浏览、筛选、分析 |
## 数据结构
### `embeddings` 配置
python
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
example = ds["train"][0]
example["name"] # 示例值:"AccountantDashboardNews_get"
example["description"] # 示例描述:"获取公共新闻文章"
example["parameters"] # 格式为JSON字符串,例如:'{"from": "integer", "count": "integer", ...}'
example["embedding"] # 由3072个浮点数组成的数组 — Google `gemini-embedding-001` 生成的向量
### `tools` 配置
python
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings", name="tools")
example = ds["train"][0]
example["name"] # 示例值:"AccountantDashboardNews_get"
example["description"] # 示例描述:"获取公共新闻文章"
example["parameters"] # 格式为JSON字符串,例如:'{"from": "integer", "count": "integer", ...}'
## 数据概览
- **800个API工具**:源自Tripletex会计API(OpenAPI 3.0.1规范)
- **3072维嵌入向量**:通过Google `gemini-embedding-001` 模型生成
- **参数字段**:以JSON字符串形式存储,映射参数名称与其数据类型
- **数据源**:本仓库内置的`openapi.json`文件(大小3.5 MB,包含546个接口路径、2167个数据模式)
## 与LanceDB结合使用
python
from datasets import load_dataset
import lancedb
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
# 转换为LanceDB向量存储
db = lancedb.connect(".tool_embeddings")
records = [
{
"name": row["name"],
"description": row["description"],
"parameters": row["parameters"],
"embedding": row["embedding"],
}
for row in ds["train"]
]
table = db.create_table("tools", data=records, mode="overwrite")
# 向量检索
results = table.search(query_embedding).limit(100).to_list()
## 与FAISS结合使用
python
from datasets import load_dataset
import numpy as np
import faiss
ds = load_dataset("valiantlynxz/tripletex-tool-embeddings")
embeddings = np.array(ds["train"]["embedding"], dtype=np.float32)
index = faiss.IndexFlatIP(3072)
faiss.normalize_L2(embeddings)
index.add(embeddings)
# 向量检索
query = np.array([query_embedding], dtype=np.float32)
faiss.normalize_L2(query)
distances, indices = index.search(query, k=100)
tool_names = [ds["train"][int(i)]["name"] for i in indices[0]]
## 重新生成嵌入向量
`scripts/` 目录包含原始的嵌入向量生成流水线:
- `scripts/embeddings.py` — Google Gemini嵌入向量生成工具
- `scripts/rag_tool_filter.py` — OpenAPI规范转嵌入向量流水线 + LanceDB向量存储实现
python
# 依赖库:google-genai、lancedb
# 环境变量:需配置GCP_API_KEY
from scripts.embeddings import get_embedding_provider
from scripts.rag_tool_filter import ToolEmbedder, ToolVectorStore, index_openapi_tools
import json, asyncio
with open("openapi.json") as f:
spec = json.load(f)
provider = get_embedding_provider()
embedder = ToolEmbedder(provider)
store = ToolVectorStore(".tool_embeddings")
asyncio.run(index_openapi_tools(spec, store, embedder))
## 仓库结构
tripletex-tool-embeddings/
├── README.md
├── embeddings.parquet # 包含800个工具的3072维嵌入向量文件
├── tools.parquet # 仅包含800个工具的元数据(轻量版本)
├── openapi.json # Tripletex原始OpenAPI 3.0.1规范文件(大小3.5 MB)
└── scripts/
├── embeddings.py # Google Gemini嵌入向量生成工具
└── rag_tool_filter.py # OpenAPI规范提取与LanceDB索引构建脚本
## 源项目
本数据集是[nmai](https://github.com/kuben-labs/nmai)项目的一部分,对应`ai-accounting-agent/`模块。
提供机构:
valiantlynxz



