biztiger/nyayabench-v2
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/biztiger/nyayabench-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
- hi
- es
- ar
- zh
- ja
- fr
- de
- pt
- ru
- ko
- tr
- multilingual
pretty_name: NyayaBench v2
size_categories:
- 1K<n<10K
task_categories:
- text-classification
tags:
- intent-classification
- agent-caching
- w5h2
- few-shot
- multilingual
- agentic
- setfit
- contrastive-learning
- llm-cost-reduction
configs:
- config_name: default
data_files:
- split: full
path: data/nyayabench_v2.jsonl
- split: train
path: data/train.jsonl
- split: test_en
path: data/test_en.jsonl
- split: test_multilingual
path: data/test_multilingual.jsonl
dataset_info:
features:
- name: query
dtype: string
- name: intent
dtype: string
- name: w5h2_class
dtype: string
- name: language
dtype: string
- name: source
dtype: string
- name: popularity
dtype: int64
- name: is_compound
dtype: bool
- name: discovered_intent
dtype: bool
- name: region
dtype: string
- name: dialect
dtype: string
- name: register
dtype: string
- name: has_params
dtype: bool
splits:
- name: full
num_examples: 8514
- name: train
num_examples: 160
- name: test_en
num_examples: 1193
- name: test_multilingual
num_examples: 7161
---
# NyayaBench v2
A real-world agentic intent classification benchmark sourced from production personal AI agent interactions. Unlike synthetic benchmarks, NyayaBench v2 captures the messy reality of how people actually talk to AI agents — compound queries, regional phrasing, and the long tail of 528 intents that existing caching methods can't handle.
**Paper:** [Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning](https://arxiv.org/abs/2602.18922) (arXiv:2602.18922)
**Code:** [nabaos/w5h2-intent-cache](https://github.com/nabaos/w5h2-intent-cache) — W5H2 taxonomy, SetFit evaluation, ablation studies, ONNX export
## Why This Dataset
Most intent classification benchmarks (MASSIVE, CLINC150, BANKING77) are collected in controlled settings with clean, single-intent utterances. Real agent traffic looks different:
- **Compound queries** — "check my email and set a reminder for the meeting at 3"
- **Extreme class imbalance** — `check_info` has 1,528 entries, `add_list` has 50
- **63 languages** — not just translated templates, but region-specific intents (Japanese transit queries, Latin American fintech, South Asian agriculture)
- **528 fine-grained intents** that must be compressed to 20 actionable cache keys
This makes NyayaBench v2 substantially harder than established benchmarks and a better proxy for production agent systems.
## The Problem It Tests
GPTCache and similar embedding-similarity caches achieve only 3–38% hit rate on personal agent tasks. The paper shows this happens because they optimize for the wrong property: semantic similarity instead of *key consistency*. W5H2 decomposes intents into **action × target** pairs (e.g., `check` × `weather` → `check_info`) and uses few-shot contrastive learning to classify them in ~2ms — achieving 91.1% accuracy with just 8 examples per class.
This approach is used in production by [NabaOS](https://github.com/nabaos/nabaos), an open-source autonomous agent runtime, where the W5H2 cache tier handles 85% of interactions locally at near-zero cost as part of a five-tier cascade architecture.
## Overview
| Property | Value |
|----------|-------|
| Total entries | 8,514 |
| Fine-grained intents | 528 |
| W5H2 super-classes | 20 |
| Languages | 63 |
| Sources | Voice assistants, smart home, IoT, productivity agents, regional deployments |
| Compound queries | Yes (flagged) |
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("biztiger/nyayabench-v2", data_files={
"full": "data/nyayabench_v2.jsonl",
"train": "data/train.jsonl",
"test_en": "data/test_en.jsonl",
"test_multilingual": "data/test_multilingual.jsonl",
})
# 8-shot train split (160 examples, 8 per W5H2 class)
print(ds["train"][0])
# {'query': '...', 'intent': '...', 'w5h2_class': 'check_info', 'language': 'en', ...}
```
## W5H2 Taxonomy (20 Classes)
The W5H2 framework decomposes intents into **action** × **target** pairs:
| Class | Description | Count | Examples |
|-------|-------------|-------|----------|
| `check_info` | Query information | 1,528 | Weather, news, prices, sports, traffic |
| `order_commerce` | Commerce & booking | 1,084 | Food, rides, hotels, tickets |
| `analyze_data` | Analyze data | 992 | Financial, sentiment, data analysis |
| `search_web` | Search & discover | 841 | Web search, directions, recipes |
| `create_content` | Generate content | 623 | Write code, documents, presentations |
| `control_home` | Control smart home | 603 | Lights, thermostat, locks, vacuum |
| `manage_tasks` | Manage tasks/files | 505 | Files, notes, expenses, fitness |
| `send_comms` | Send communications | 309 | Message, email, call |
| `play_media` | Play media | 304 | Music, radio, ambient sounds |
| `convert_data` | Convert/calculate | 262 | Translate, units, currency |
| `research_info` | Research & analysis | 245 | Market research, competitor analysis |
| `check_home` | Check home status | 202 | Camera, door, battery, temperature |
| `control_media` | Control media | 199 | Volume, TV, podcast controls |
| `set_schedule` | Set time-based | 199 | Alarm, reminder, timer |
| `check_comms` | Check communications | 139 | Read email, email queries |
| `run_routine` | Execute routines | 137 | Morning routine, smart home scenes |
| `check_schedule` | Check calendar | 128 | Calendar events, reminders |
| `manage_comms` | Manage communications | 103 | Email triage, classification |
| `schedule_meeting` | Schedule meetings | 61 | Meeting scheduling |
| `add_list` | Add to lists | 50 | Shopping list, todo list |
## Splits
| Split | Size | Description |
|-------|------|-------------|
| `full` | 8,514 | Complete dataset, all languages |
| `train` | 160 | 8-shot stratified (8 per class, seed=42) — matches paper setup |
| `test_en` | 1,193 | English test set |
| `test_multilingual` | 7,161 | 30 translated + 32 source languages |
## Schema
```json
{
"query": "research the top 5 competitor products and summarize their pricing",
"intent": "research_competitors",
"w5h2_class": "research_info",
"language": "en",
"source": "auto_gpt_examples",
"popularity": 8,
"is_compound": true,
"discovered_intent": true,
"region": "north_america",
"dialect": null,
"register": "informal",
"has_params": true
}
```
| Field | Description |
|-------|-------------|
| `query` | Natural language user utterance |
| `intent` | Fine-grained intent label (528 unique) |
| `w5h2_class` | W5H2 super-class (20 unique) |
| `language` | ISO 639-1 language code |
| `source` | Collection source |
| `popularity` | Usage frequency score (1–8) |
| `is_compound` | Whether query contains multiple intents |
| `discovered_intent` | Whether intent was discovered during collection (vs. predefined) |
| `region` | Geographic region (e.g., `east_asia`, `south_asia`, `latin_america`) — multilingual entries only |
| `dialect` | Language dialect or variant where applicable |
| `register` | Formality register (`formal`, `informal`, `colloquial`) |
| `has_params` | Whether query contains extractable parameters (times, names, quantities) |
## Benchmark Results
From the paper, evaluated on the English test split (1,193 examples, 20 classes):
| Method | Accuracy | V-measure | Latency |
|--------|----------|-----------|---------|
| GPTCache (cosine threshold) | 37.9% | — | ~50ms |
| GPTCache (KMeans k=20) | 49.1% | 0.397 | ~50ms |
| LLM baseline (20B) | 68.8% | — | 3,447ms |
| **SetFit 8-shot (22M params)** | **55.3% ± 1.0%** | **0.504** | **2.4ms** |
| SetFit 16-shot | 62.6% | 0.558 | 2.4ms |
| BERT fine-tuned (full data) | 97.3% | 0.926 | ~5ms |
NyayaBench v2 is intentionally harder than MASSIVE (where SetFit gets 91.1%) — the 528→20 compression means each class absorbs dozens of semantically diverse phrasings. This reflects the real difficulty of agent caching.
Cross-lingual zero-shot transfer (trained on English only): mean 37.7% across 30 languages, with 5 languages above 50%.
## Citation
```bibtex
@article{basu2026w5h2,
title={Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning},
author={Basu, Abhinaba},
journal={arXiv preprint arXiv:2602.18922},
year={2026}
}
```
## License
CC BY 4.0
许可证:CC BY 4.0
语言:
- 英语
- 印地语
- 西班牙语
- 阿拉伯语
- 汉语
- 日语
- 法语
- 德语
- 葡萄牙语
- 俄语
- 韩语
- 土耳其语
- 多语言
友好名称:NyayaBench v2
样本量范围:1000 < 样本数 < 10000
任务类别:
- 文本分类
标签:
- 意图分类
- 智能体缓存(agent-caching)
- W5H2
- 少样本(few-shot)
- 多语言
- 智能体式(agentic)
- SetFit
- 对比学习
- 大语言模型成本降低
配置项:
- 配置名称:default
数据文件:
- 划分:full
路径:data/nyayabench_v2.jsonl
- 划分:train
路径:data/train.jsonl
- 划分:test_en
路径:data/test_en.jsonl
- 划分:test_multilingual
路径:data/test_multilingual.jsonl
数据集信息:
特征:
- 名称:query
数据类型:字符串
- 名称:intent
数据类型:字符串
- 名称:w5h2_class
数据类型:字符串
- 名称:language
数据类型:字符串
- 名称:source
数据类型:字符串
- 名称:popularity
数据类型:整数
- 名称:is_compound
数据类型:布尔值
- 名称:discovered_intent
数据类型:布尔值
- 名称:region
数据类型:字符串
- 名称:dialect
数据类型:字符串
- 名称:register
数据类型:字符串
- 名称:has_params
数据类型:布尔值
数据集划分:
- 名称:full
样本数:8514
- 名称:train
样本数:160
- 名称:test_en
样本数:1193
- 名称:test_multilingual
样本数:7161
---
# NyayaBench v2
**NyayaBench v2** 是一款源自量产个人AI智能体(AI Agent)交互数据的真实场景意图分类基准数据集。与合成基准数据集不同,NyayaBench v2 还原了用户与AI智能体真实交流的复杂现实:包括复合查询、区域化表达,以及现有缓存方法无法覆盖的528种意图长尾分布。
**论文**:[《智能体缓存失效的原因与解决方案:基于少样本学习的结构化意图规范化》](https://arxiv.org/abs/2602.18922)(arXiv:2602.18922)
**代码**:[nabaos/w5h2-intent-cache](https://github.com/nabaos/w5h2-intent-cache) — 包含W5H2分类体系、SetFit评估、消融实验、ONNX导出实现。
## 数据集设计动机
目前主流的意图分类基准数据集(如MASSIVE、CLINC150、BANKING77)均采集自受控场景,数据为干净的单意图语句。但真实智能体交互流量截然不同:
- **复合查询**:例如“查看我的邮件并为3点的会议设置提醒”
- **极端类别不平衡**:`check_info` 类别包含1528条样本,而`add_list`仅含50条
- **覆盖63种语言**:并非仅翻译模板句,而是包含区域化专属意图(如日本的交通查询、拉美金融科技场景、南亚农业场景)
- **528种细粒度意图**,需压缩为20个可落地的缓存键
这使得NyayaBench v2 远难于现有基准数据集,更贴近真实智能体系统的应用场景。
## 测试目标问题
GPTCache 及同类嵌入相似度缓存方案在个人智能体任务中的命中率仅为3%~38%。该论文指出,这是因为此类方案优化的是错误的属性:语义相似度而非**键一致性**。W5H2框架将意图拆解为**动作×目标**对(例如`check` × `weather` → `check_info`),并使用少样本对比学习在约2ms内完成分类——仅需每类8个样本即可达到91.1%的准确率。
该方案已在开源自主智能体运行时框架[NabaOS](https://github.com/nabaos/nabaos)的生产环境中落地:作为五级级联架构的一部分,W5H2缓存层可在本地处理85%的交互请求,且成本近乎为零。
## 数据集概览
| 属性 | 取值 |
|----------|-------|
| 总样本数 | 8,514 |
| 细粒度意图数 | 528 |
| W5H2超类数 | 20 |
| 支持语言 | 63种 |
| 数据来源 | 语音助手、智能家居、物联网、生产力智能体、区域化部署场景 |
| 复合查询 | 支持(已标记) |
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("biztiger/nyayabench-v2", data_files={
"full": "data/nyayabench_v2.jsonl",
"train": "data/train.jsonl",
"test_en": "data/test_en.jsonl",
"test_multilingual": "data/test_multilingual.jsonl",
})
# 8样本训练集划分(共160条样本,每个W5H2类别对应8条样本)
print(ds["train"][0])
# {'query': '...', 'intent': '...', 'w5h2_class': 'check_info', 'language': 'en', ...}
## W5H2分类体系(共20个超类)
W5H2框架将意图拆解为**动作×目标**对:
| 类别 | 描述 | 样本数 | 示例 |
|-------|-------------|-------|----------|
| `check_info` | 查询信息 | 1,528 | 天气、新闻、价格、体育、交通 |
| `order_commerce` | 商务与预订 | 1,084 | 餐饮、出行、酒店、票务 |
| `analyze_data` | 数据分析 | 992 | 金融、情感、数据分析 |
| `search_web` | 网页搜索与发现 | 841 | 网页搜索、导航、食谱查询 |
| `create_content` | 内容生成 | 623 | 编写代码、文档、演示文稿 |
| `control_home` | 智能家居控制 | 603 | 灯光、温控、门锁、吸尘器 |
| `manage_tasks` | 任务/文件管理 | 505 | 文件、笔记、开支、健身 |
| `send_comms` | 发送通信 | 309 | 发送消息、邮件、拨打电话 |
| `play_media` | 播放媒体 | 304 | 音乐、广播、环境音 |
| `convert_data` | 转换/计算 | 262 | 翻译、单位换算、货币兑换 |
| `research_info` | 信息调研 | 245 | 市场调研、竞品分析 |
| `check_home` | 智能家居状态查询 | 202 | 摄像头、门锁、电量、温度 |
| `control_media` | 媒体控制 | 199 | 音量、电视、播客控制 |
| `set_schedule` | 设置时间相关任务 | 199 | 闹钟、提醒、计时器 |
| `check_comms` | 通信状态查询 | 139 | 读取邮件、邮件查询 |
| `run_routine` | 执行预设流程 | 137 | 晨间流程、智能家居场景 |
| `check_schedule` | 日历查询 | 128 | 日历事件、提醒事项 |
| `manage_comms` | 通信管理 | 103 | 邮件分拣、分类 |
| `schedule_meeting` | 会议安排 | 61 | 会议预约 |
| `add_list` | 添加至列表 | 50 | 购物清单、待办事项 |
## 数据集划分
| 划分 | 样本量 | 描述 |
|-------|------|-------------|
| `full` | 8,514 | 完整数据集,覆盖所有语言 |
| `train` | 160 | 8样本分层采样(每类8个样本,随机种子=42)—— 与论文实验设置一致 |
| `test_en` | 1,193 | 英语测试集 |
| `test_multilingual` | 7,161 | 包含30种翻译语言+32种源语言的多语言测试集 |
## 数据Schema
json
{
"query": "research the top 5 competitor products and summarize their pricing",
"intent": "research_competitors",
"w5h2_class": "research_info",
"language": "en",
"source": "auto_gpt_examples",
"popularity": 8,
"is_compound": true,
"discovered_intent": true,
"region": "north_america",
"dialect": null,
"register": "informal",
"has_params": true
}
| 字段 | 说明 |
|-------|-------------|
| `query` | 用户自然语言查询语句 |
| `intent` | 细粒度意图标签(共528种唯一标签) |
| `w5h2_class` | W5H2超类标签(共20种唯一标签) |
| `language` | ISO 639-1 语言代码 |
| `source` | 数据采集来源 |
| `popularity` | 使用频率评分(1~8) |
| `is_compound` | 查询是否包含多个意图 |
| `discovered_intent` | 该意图是否为采集过程中发现(而非预定义) |
| `region` | 地理区域(例如`east_asia`、`south_asia`、`latin_america`)—— 仅多语言条目包含此字段 |
| `dialect` | 适用的语言方言或变体 |
| `register` | 语域(`formal`正式、`informal`非正式、`colloquial`口语化) |
| `has_params` | 查询是否包含可提取参数(时间、名称、数量等) |
## 基准测试结果
基于论文中的实验结果,在英语测试集(1193条样本,共20个类别)上的评估结果如下:
| 方法 | 准确率 | V-measure | 延迟 |
|--------|----------|-----------|---------|
| GPTCache(余弦相似度阈值) | 37.9% | — | ~50ms |
| GPTCache(KMeans k=20) | 49.1% | 0.397 | ~50ms |
| 大语言模型(LLM)基线(200亿参数) | 68.8% | — | 3,447ms |
| **SetFit 8样本(2200万参数)** | **55.3% ± 1.0%** | **0.504** | **2.4ms** |
| SetFit 16样本 | 62.6% | 0.558 | 2.4ms |
| BERT全数据微调 | 97.3% | 0.926 | ~5ms |
NyayaBench v2 刻意设计得比MASSIVE数据集(SetFit在该数据集上可达到91.1%准确率)更具挑战性:将528个细粒度意图压缩为20个超类,意味着每个超类需覆盖数十种语义各异的表达方式,这更贴近智能体缓存任务的真实难度。
跨语言零样本(zero-shot)迁移实验(仅使用英语训练):在30种语言上的平均准确率为37.7%,其中5种语言的准确率超过50%。
## 引用格式
bibtex
@article{basu2026w5h2,
title={Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning},
author={Basu, Abhinaba},
journal={arXiv preprint arXiv:2602.18922},
year={2026}
}
## 许可证
CC BY 4.0
提供机构:
biztiger



