biztiger/nyayabench-v2

Name: biztiger/nyayabench-v2
Creator: biztiger
Published: 2026-03-22 01:47:47
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/biztiger/nyayabench-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en - hi - es - ar - zh - ja - fr - de - pt - ru - ko - tr - multilingual pretty_name: NyayaBench v2 size_categories: - 1K<n<10K task_categories: - text-classification tags: - intent-classification - agent-caching - w5h2 - few-shot - multilingual - agentic - setfit - contrastive-learning - llm-cost-reduction configs: - config_name: default data_files: - split: full path: data/nyayabench_v2.jsonl - split: train path: data/train.jsonl - split: test_en path: data/test_en.jsonl - split: test_multilingual path: data/test_multilingual.jsonl dataset_info: features: - name: query dtype: string - name: intent dtype: string - name: w5h2_class dtype: string - name: language dtype: string - name: source dtype: string - name: popularity dtype: int64 - name: is_compound dtype: bool - name: discovered_intent dtype: bool - name: region dtype: string - name: dialect dtype: string - name: register dtype: string - name: has_params dtype: bool splits: - name: full num_examples: 8514 - name: train num_examples: 160 - name: test_en num_examples: 1193 - name: test_multilingual num_examples: 7161 --- # NyayaBench v2 A real-world agentic intent classification benchmark sourced from production personal AI agent interactions. Unlike synthetic benchmarks, NyayaBench v2 captures the messy reality of how people actually talk to AI agents — compound queries, regional phrasing, and the long tail of 528 intents that existing caching methods can't handle. **Paper:** [Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning](https://arxiv.org/abs/2602.18922) (arXiv:2602.18922) **Code:** [nabaos/w5h2-intent-cache](https://github.com/nabaos/w5h2-intent-cache) — W5H2 taxonomy, SetFit evaluation, ablation studies, ONNX export ## Why This Dataset Most intent classification benchmarks (MASSIVE, CLINC150, BANKING77) are collected in controlled settings with clean, single-intent utterances. Real agent traffic looks different: - **Compound queries** — "check my email and set a reminder for the meeting at 3" - **Extreme class imbalance** — `check_info` has 1,528 entries, `add_list` has 50 - **63 languages** — not just translated templates, but region-specific intents (Japanese transit queries, Latin American fintech, South Asian agriculture) - **528 fine-grained intents** that must be compressed to 20 actionable cache keys This makes NyayaBench v2 substantially harder than established benchmarks and a better proxy for production agent systems. ## The Problem It Tests GPTCache and similar embedding-similarity caches achieve only 3–38% hit rate on personal agent tasks. The paper shows this happens because they optimize for the wrong property: semantic similarity instead of *key consistency*. W5H2 decomposes intents into **action × target** pairs (e.g., `check` × `weather` → `check_info`) and uses few-shot contrastive learning to classify them in ~2ms — achieving 91.1% accuracy with just 8 examples per class. This approach is used in production by [NabaOS](https://github.com/nabaos/nabaos), an open-source autonomous agent runtime, where the W5H2 cache tier handles 85% of interactions locally at near-zero cost as part of a five-tier cascade architecture. ## Overview | Property | Value | |----------|-------| | Total entries | 8,514 | | Fine-grained intents | 528 | | W5H2 super-classes | 20 | | Languages | 63 | | Sources | Voice assistants, smart home, IoT, productivity agents, regional deployments | | Compound queries | Yes (flagged) | ## Quick Start ```python from datasets import load_dataset ds = load_dataset("biztiger/nyayabench-v2", data_files={ "full": "data/nyayabench_v2.jsonl", "train": "data/train.jsonl", "test_en": "data/test_en.jsonl", "test_multilingual": "data/test_multilingual.jsonl", }) # 8-shot train split (160 examples, 8 per W5H2 class) print(ds["train"][0]) # {'query': '...', 'intent': '...', 'w5h2_class': 'check_info', 'language': 'en', ...} ``` ## W5H2 Taxonomy (20 Classes) The W5H2 framework decomposes intents into **action** × **target** pairs: | Class | Description | Count | Examples | |-------|-------------|-------|----------| | `check_info` | Query information | 1,528 | Weather, news, prices, sports, traffic | | `order_commerce` | Commerce & booking | 1,084 | Food, rides, hotels, tickets | | `analyze_data` | Analyze data | 992 | Financial, sentiment, data analysis | | `search_web` | Search & discover | 841 | Web search, directions, recipes | | `create_content` | Generate content | 623 | Write code, documents, presentations | | `control_home` | Control smart home | 603 | Lights, thermostat, locks, vacuum | | `manage_tasks` | Manage tasks/files | 505 | Files, notes, expenses, fitness | | `send_comms` | Send communications | 309 | Message, email, call | | `play_media` | Play media | 304 | Music, radio, ambient sounds | | `convert_data` | Convert/calculate | 262 | Translate, units, currency | | `research_info` | Research & analysis | 245 | Market research, competitor analysis | | `check_home` | Check home status | 202 | Camera, door, battery, temperature | | `control_media` | Control media | 199 | Volume, TV, podcast controls | | `set_schedule` | Set time-based | 199 | Alarm, reminder, timer | | `check_comms` | Check communications | 139 | Read email, email queries | | `run_routine` | Execute routines | 137 | Morning routine, smart home scenes | | `check_schedule` | Check calendar | 128 | Calendar events, reminders | | `manage_comms` | Manage communications | 103 | Email triage, classification | | `schedule_meeting` | Schedule meetings | 61 | Meeting scheduling | | `add_list` | Add to lists | 50 | Shopping list, todo list | ## Splits | Split | Size | Description | |-------|------|-------------| | `full` | 8,514 | Complete dataset, all languages | | `train` | 160 | 8-shot stratified (8 per class, seed=42) — matches paper setup | | `test_en` | 1,193 | English test set | | `test_multilingual` | 7,161 | 30 translated + 32 source languages | ## Schema ```json { "query": "research the top 5 competitor products and summarize their pricing", "intent": "research_competitors", "w5h2_class": "research_info", "language": "en", "source": "auto_gpt_examples", "popularity": 8, "is_compound": true, "discovered_intent": true, "region": "north_america", "dialect": null, "register": "informal", "has_params": true } ``` | Field | Description | |-------|-------------| | `query` | Natural language user utterance | | `intent` | Fine-grained intent label (528 unique) | | `w5h2_class` | W5H2 super-class (20 unique) | | `language` | ISO 639-1 language code | | `source` | Collection source | | `popularity` | Usage frequency score (1–8) | | `is_compound` | Whether query contains multiple intents | | `discovered_intent` | Whether intent was discovered during collection (vs. predefined) | | `region` | Geographic region (e.g., `east_asia`, `south_asia`, `latin_america`) — multilingual entries only | | `dialect` | Language dialect or variant where applicable | | `register` | Formality register (`formal`, `informal`, `colloquial`) | | `has_params` | Whether query contains extractable parameters (times, names, quantities) | ## Benchmark Results From the paper, evaluated on the English test split (1,193 examples, 20 classes): | Method | Accuracy | V-measure | Latency | |--------|----------|-----------|---------| | GPTCache (cosine threshold) | 37.9% | — | ~50ms | | GPTCache (KMeans k=20) | 49.1% | 0.397 | ~50ms | | LLM baseline (20B) | 68.8% | — | 3,447ms | | **SetFit 8-shot (22M params)** | **55.3% ± 1.0%** | **0.504** | **2.4ms** | | SetFit 16-shot | 62.6% | 0.558 | 2.4ms | | BERT fine-tuned (full data) | 97.3% | 0.926 | ~5ms | NyayaBench v2 is intentionally harder than MASSIVE (where SetFit gets 91.1%) — the 528→20 compression means each class absorbs dozens of semantically diverse phrasings. This reflects the real difficulty of agent caching. Cross-lingual zero-shot transfer (trained on English only): mean 37.7% across 30 languages, with 5 languages above 50%. ## Citation ```bibtex @article{basu2026w5h2, title={Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning}, author={Basu, Abhinaba}, journal={arXiv preprint arXiv:2602.18922}, year={2026} } ``` ## License CC BY 4.0

许可证：CC BY 4.0 语言： - 英语 - 印地语 - 西班牙语 - 阿拉伯语 - 汉语 - 日语 - 法语 - 德语 - 葡萄牙语 - 俄语 - 韩语 - 土耳其语 - 多语言友好名称：NyayaBench v2 样本量范围：1000 < 样本数 < 10000 任务类别： - 文本分类标签： - 意图分类 - 智能体缓存（agent-caching） - W5H2 - 少样本（few-shot） - 多语言 - 智能体式（agentic） - SetFit - 对比学习 - 大语言模型成本降低配置项： - 配置名称：default 数据文件： - 划分：full 路径：data/nyayabench_v2.jsonl - 划分：train 路径：data/train.jsonl - 划分：test_en 路径：data/test_en.jsonl - 划分：test_multilingual 路径：data/test_multilingual.jsonl 数据集信息：特征： - 名称：query 数据类型：字符串 - 名称：intent 数据类型：字符串 - 名称：w5h2_class 数据类型：字符串 - 名称：language 数据类型：字符串 - 名称：source 数据类型：字符串 - 名称：popularity 数据类型：整数 - 名称：is_compound 数据类型：布尔值 - 名称：discovered_intent 数据类型：布尔值 - 名称：region 数据类型：字符串 - 名称：dialect 数据类型：字符串 - 名称：register 数据类型：字符串 - 名称：has_params 数据类型：布尔值数据集划分： - 名称：full 样本数：8514 - 名称：train 样本数：160 - 名称：test_en 样本数：1193 - 名称：test_multilingual 样本数：7161 --- # NyayaBench v2 **NyayaBench v2** 是一款源自量产个人AI智能体（AI Agent）交互数据的真实场景意图分类基准数据集。与合成基准数据集不同，NyayaBench v2 还原了用户与AI智能体真实交流的复杂现实：包括复合查询、区域化表达，以及现有缓存方法无法覆盖的528种意图长尾分布。 **论文**：[《智能体缓存失效的原因与解决方案：基于少样本学习的结构化意图规范化》](https://arxiv.org/abs/2602.18922)（arXiv:2602.18922） **代码**：[nabaos/w5h2-intent-cache](https://github.com/nabaos/w5h2-intent-cache) — 包含W5H2分类体系、SetFit评估、消融实验、ONNX导出实现。 ## 数据集设计动机目前主流的意图分类基准数据集（如MASSIVE、CLINC150、BANKING77）均采集自受控场景，数据为干净的单意图语句。但真实智能体交互流量截然不同： - **复合查询**：例如“查看我的邮件并为3点的会议设置提醒” - **极端类别不平衡**：`check_info` 类别包含1528条样本，而`add_list`仅含50条 - **覆盖63种语言**：并非仅翻译模板句，而是包含区域化专属意图（如日本的交通查询、拉美金融科技场景、南亚农业场景） - **528种细粒度意图**，需压缩为20个可落地的缓存键这使得NyayaBench v2 远难于现有基准数据集，更贴近真实智能体系统的应用场景。 ## 测试目标问题 GPTCache 及同类嵌入相似度缓存方案在个人智能体任务中的命中率仅为3%~38%。该论文指出，这是因为此类方案优化的是错误的属性：语义相似度而非**键一致性**。W5H2框架将意图拆解为**动作×目标**对（例如`check` × `weather` → `check_info`），并使用少样本对比学习在约2ms内完成分类——仅需每类8个样本即可达到91.1%的准确率。该方案已在开源自主智能体运行时框架[NabaOS](https://github.com/nabaos/nabaos)的生产环境中落地：作为五级级联架构的一部分，W5H2缓存层可在本地处理85%的交互请求，且成本近乎为零。 ## 数据集概览 | 属性 | 取值 | |----------|-------| | 总样本数 | 8,514 | | 细粒度意图数 | 528 | | W5H2超类数 | 20 | | 支持语言 | 63种 | | 数据来源 | 语音助手、智能家居、物联网、生产力智能体、区域化部署场景 | | 复合查询 | 支持（已标记） | ## 快速上手 python from datasets import load_dataset ds = load_dataset("biztiger/nyayabench-v2", data_files={ "full": "data/nyayabench_v2.jsonl", "train": "data/train.jsonl", "test_en": "data/test_en.jsonl", "test_multilingual": "data/test_multilingual.jsonl", }) # 8样本训练集划分（共160条样本，每个W5H2类别对应8条样本） print(ds["train"][0]) # {'query': '...', 'intent': '...', 'w5h2_class': 'check_info', 'language': 'en', ...} ## W5H2分类体系（共20个超类） W5H2框架将意图拆解为**动作×目标**对： | 类别 | 描述 | 样本数 | 示例 | |-------|-------------|-------|----------| | `check_info` | 查询信息 | 1,528 | 天气、新闻、价格、体育、交通 | | `order_commerce` | 商务与预订 | 1,084 | 餐饮、出行、酒店、票务 | | `analyze_data` | 数据分析 | 992 | 金融、情感、数据分析 | | `search_web` | 网页搜索与发现 | 841 | 网页搜索、导航、食谱查询 | | `create_content` | 内容生成 | 623 | 编写代码、文档、演示文稿 | | `control_home` | 智能家居控制 | 603 | 灯光、温控、门锁、吸尘器 | | `manage_tasks` | 任务/文件管理 | 505 | 文件、笔记、开支、健身 | | `send_comms` | 发送通信 | 309 | 发送消息、邮件、拨打电话 | | `play_media` | 播放媒体 | 304 | 音乐、广播、环境音 | | `convert_data` | 转换/计算 | 262 | 翻译、单位换算、货币兑换 | | `research_info` | 信息调研 | 245 | 市场调研、竞品分析 | | `check_home` | 智能家居状态查询 | 202 | 摄像头、门锁、电量、温度 | | `control_media` | 媒体控制 | 199 | 音量、电视、播客控制 | | `set_schedule` | 设置时间相关任务 | 199 | 闹钟、提醒、计时器 | | `check_comms` | 通信状态查询 | 139 | 读取邮件、邮件查询 | | `run_routine` | 执行预设流程 | 137 | 晨间流程、智能家居场景 | | `check_schedule` | 日历查询 | 128 | 日历事件、提醒事项 | | `manage_comms` | 通信管理 | 103 | 邮件分拣、分类 | | `schedule_meeting` | 会议安排 | 61 | 会议预约 | | `add_list` | 添加至列表 | 50 | 购物清单、待办事项 | ## 数据集划分 | 划分 | 样本量 | 描述 | |-------|------|-------------| | `full` | 8,514 | 完整数据集，覆盖所有语言 | | `train` | 160 | 8样本分层采样（每类8个样本，随机种子=42）—— 与论文实验设置一致 | | `test_en` | 1,193 | 英语测试集 | | `test_multilingual` | 7,161 | 包含30种翻译语言+32种源语言的多语言测试集 | ## 数据Schema json { "query": "research the top 5 competitor products and summarize their pricing", "intent": "research_competitors", "w5h2_class": "research_info", "language": "en", "source": "auto_gpt_examples", "popularity": 8, "is_compound": true, "discovered_intent": true, "region": "north_america", "dialect": null, "register": "informal", "has_params": true } | 字段 | 说明 | |-------|-------------| | `query` | 用户自然语言查询语句 | | `intent` | 细粒度意图标签（共528种唯一标签） | | `w5h2_class` | W5H2超类标签（共20种唯一标签） | | `language` | ISO 639-1 语言代码 | | `source` | 数据采集来源 | | `popularity` | 使用频率评分（1~8） | | `is_compound` | 查询是否包含多个意图 | | `discovered_intent` | 该意图是否为采集过程中发现（而非预定义） | | `region` | 地理区域（例如`east_asia`、`south_asia`、`latin_america`）—— 仅多语言条目包含此字段 | | `dialect` | 适用的语言方言或变体 | | `register` | 语域（`formal`正式、`informal`非正式、`colloquial`口语化） | | `has_params` | 查询是否包含可提取参数（时间、名称、数量等） | ## 基准测试结果基于论文中的实验结果，在英语测试集（1193条样本，共20个类别）上的评估结果如下： | 方法 | 准确率 | V-measure | 延迟 | |--------|----------|-----------|---------| | GPTCache（余弦相似度阈值） | 37.9% | — | ~50ms | | GPTCache（KMeans k=20） | 49.1% | 0.397 | ~50ms | | 大语言模型（LLM）基线（200亿参数） | 68.8% | — | 3,447ms | | **SetFit 8样本（2200万参数）** | **55.3% ± 1.0%** | **0.504** | **2.4ms** | | SetFit 16样本 | 62.6% | 0.558 | 2.4ms | | BERT全数据微调 | 97.3% | 0.926 | ~5ms | NyayaBench v2 刻意设计得比MASSIVE数据集（SetFit在该数据集上可达到91.1%准确率）更具挑战性：将528个细粒度意图压缩为20个超类，意味着每个超类需覆盖数十种语义各异的表达方式，这更贴近智能体缓存任务的真实难度。跨语言零样本（zero-shot）迁移实验（仅使用英语训练）：在30种语言上的平均准确率为37.7%，其中5种语言的准确率超过50%。 ## 引用格式 bibtex @article{basu2026w5h2, title={Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning}, author={Basu, Abhinaba}, journal={arXiv preprint arXiv:2602.18922}, year={2026} } ## 许可证 CC BY 4.0

提供机构：

biztiger

5,000+

优质数据集

54 个

任务类型

进入经典数据集