birdsql/tapilot-crossing

Name: birdsql/tapilot-crossing
Creator: birdsql
Published: 2026-04-05 17:20:21
License: 暂无描述

Hugging Face2026-04-05 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/birdsql/tapilot-crossing

下载链接

链接失效反馈

官方服务：

资源简介：

[🌐 Website](https://tapilot-crossing.github.io/) • [📄 Paper](https://dl.acm.org/doi/10.5555/3780338.3781706) • [💻 Tapilot-Crossing](https://github.com/bird-bench/tapilot-crossing) ## 🌐 Project Overview Tapilot-Crossing is an innovative benchmark designed to evaluate Language Model (LLM) agent performance on interactive data analysis tasks. This project introduces a cost-effective way to simulate realistic user-agent interactions via DECISION COMPANY. Tapilot-Crossing includes **1094 user intents** across **946 entries** in 8 categories, spanning 5 data domains (credit card risk, ATP tennis, fast food nutrition, laptop pricing, Melbourne housing). It also features the **A**daptive **I**nteraction **R**eflection (**AIR**) strategy, aimed at improving LLM agent ability to learn from their interaction histories, leading to significant performance enhancements. This is the [data intelligence index](https://livesqlbench.ai/data-intelligence-index/) version of the benchmark. --- ## 📦 Tapilot-Crossing Data The benchmark comprises **1094 user intents** across **952 entries** in 8 JSONL files, split into two evaluation tasks. ### Multi-Choice Tasks (426 entries) | File | Entries | Category | Description | |------|---------|----------|-------------| | `action_analysis.jsonl` | 218 | Insight Mining | The agent interprets analysis results and extracts insights (e.g., trends, correlations) to support decision-making, beyond just generating code. | | `action_una.jsonl` | 106 | Fast Fail | The agent detects that a question cannot be answered due to missing data or invalid assumptions and explicitly reports it. | | `action_bg.jsonl` | 33 | Best Guess | For under-specified queries, the agent makes reasonable assumptions based on data or commonsense instead of asking for clarification. | | `action_plotqa.jsonl` | 69 | Plot QA | The agent answers questions based on visualizations, requiring understanding of plots and relationships between variables. | ### Code Generation Tasks (520 entries) | File | Entries | Category | Description | |------|---------|----------|-------------| | `normal.jsonl` | 283 | Normal | Fully specified queries where no interaction or clarification is needed; the agent directly produces code or answers. | | `private.jsonl` | 206 | Private | Involves user-defined/private libraries. Tests the agent's ability to understand and use unseen APIs rather than relying on standard libraries. The first line contains the private library definition (`private_lib`, `private_lib_json` fields); data entries start from line 2. | | `action_correction.jsonl` | 16 | Update Code | The agent fixes bugs or refines previously generated code based on user feedback or errors. | | `private_action_correction.jsonl` | 15 | Private + Update Code | Combination of private and action_correction. The agent must both handle private libraries and iteratively fix/update code based on feedback. | ### Data Categories Explained 1. **action_analysis**: Corresponds to *Insight_Mining*. The agent interprets analysis results and extracts insights (e.g., trends, correlations) to support decision-making, beyond just generating code. 2. **action_bg (best guess)**: Corresponds to *Best_Guess*. For under-specified queries, the agent makes reasonable assumptions based on data or commonsense instead of asking for clarification. 3. **action_correction**: Corresponds to *Update_Code*. The agent fixes bugs or refines previously generated code based on user feedback or errors. 4. **action_plotqa**: Corresponds to *Plot_QA*. The agent answers questions based on visualizations, requiring understanding of plots and relationships between variables. 5. **action_una (unanswerable)**: Corresponds to *Fast_Fail*. The agent detects that a question cannot be answered due to missing data or invalid assumptions and explicitly reports it. 6. **normal**: Fully specified queries where no interaction or clarification is needed; the agent directly produces code or answers. 7. **private**: Involves user-defined/private libraries. Tests the agent's ability to understand and use unseen APIs rather than relying on standard libraries. 8. **private_action_correction**: Combination of *private* and *action_correction*. The agent must both handle private libraries and iteratively fix/update code based on feedback. ### Entries vs. Intents The benchmark has **946 entries** but evaluation reports **1094 total intents**. This is because some code generation entries contain **multiple user intents** in a single entry (indicated by `result_type` being a list). For example, a single entry might ask the agent to both filter a dataframe and plot a chart, this counts as 2 intents. Each intent is evaluated independently: the entry passes only if all intents pass, and each intent contributes separately to the total score. - Multi-choice: 426 entries = 430 intents - Code generation: 520 entries = 664 intents - **Total: 946 entries = 1094 intents** ### JSONL Schema Each entry contains the following fields: | Field | Description | |-------|-------------| | `data_id` | Unique identifier for the entry | | `domain_name` | One of: `credit_card_risk`, `ATP_tennis`, `fast_food`, `laptop_price`, `melb_housing` | | `result_type` | Expected output type: `dataframe`, `plot`, `value`, `list`, `multi_choice`, `unanswerable`, etc. | | `current_query` | The user's current-turn query | | `prompt_with_hist_txt` | Full prompt including system context and dialogue history (used as LLM input) | | `prompt_with_hist_json` | Same prompt in OpenAI Chat message format | | `reference_answer` | Ground truth code (code gen) or correct answer JSON (multi-choice) | | `ref_code_hist` | Accumulated code from all previous dialogue turns | | `ref_code_all` | `ref_code_hist` + current turn reference code (full executable reference) | | `eval_metrics` | Python evaluation code that compares predicted vs reference outputs | ## 🏆 Baseline Performance We evaluate 6 LLMs on Tapilot-Crossing using the **base** (direct prompting) strategy. Results are reported as accuracy (%). ### Overall Results | Model | Multi-Choice | Code Generation | Overall | |-------|:---:|:---:|:---:| | **Claude Opus 4.6** | **65.12** | **33.43** | **45.89** | | Kimi 2.5 | 56.98 | 8.73 | 27.70 | | Claude Sonnet 4.5 | 55.12 | 29.22 | 39.40 | | Qwen3-Coder | 49.77 | 27.41 | 36.20 | | GLM 4.7 | 49.30 | 5.12 | 22.49 | | MiniMax M2.1 | 46.05 | 19.88 | 30.16 |

提供机构：

birdsql

5,000+

优质数据集

54 个

任务类型

进入经典数据集