five

johanneskirmayr/car-bench-dataset

收藏
Hugging Face2026-02-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/johanneskirmayr/car-bench-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: # ── Tasks ────────────────────────────────────────────────────────────── - config_name: tasks_base default: true data_files: - split: train path: "tasks/base_train.jsonl" - split: test path: "tasks/base_test.jsonl" - config_name: tasks_disambiguation data_files: - split: train path: "tasks/disambiguation_train.jsonl" - split: test path: "tasks/disambiguation_test.jsonl" - config_name: tasks_hallucination data_files: - split: train path: "tasks/hallucination_train.jsonl" - split: test path: "tasks/hallucination_test.jsonl" # ── Mock Data: Navigation ────────────────────────────────────────────── - config_name: mock_locations data_files: "mock_data/navigation/locations.jsonl" - config_name: mock_pois data_files: "mock_data/navigation/pois.jsonl" - config_name: mock_weather data_files: "mock_data/navigation/weather.jsonl" - config_name: mock_routes_location_location data_files: "mock_data/navigation/routes_location_location.jsonl" - config_name: mock_routes_location_poi data_files: "mock_data/navigation/routes_location_poi.jsonl" - config_name: mock_routes_poi_location data_files: "mock_data/navigation/routes_poi_location.jsonl" - config_name: mock_routes_index data_files: "mock_data/navigation/routes_index.jsonl" - config_name: mock_routes_metadata data_files: "mock_data/navigation/routes_metadata.jsonl" # ── Mock Data: Productivity & Communication ──────────────────────────── - config_name: mock_calendars data_files: "mock_data/productivity_and_communication/calendars.jsonl" - config_name: mock_contacts data_files: "mock_data/productivity_and_communication/contacts.jsonl" license: mit task_categories: - text-generation - question-answering language: - en tags: - benchmark - car - voice-assistant - agentic - tool-use - function-calling size_categories: - 1K<n<10K --- # CAR-Bench Dataset **CAR-Bench** is a benchmark for evaluating AI voice assistants in a realistic automotive (car) environment. It tests an agent's ability to correctly use vehicle control tools, handle disambiguation, and avoid hallucinations. ## Dataset Structure The dataset is organized into **task configs** and **mock data configs**: ### Tasks Each task defines a user persona, an instruction, the initial vehicle/environment context, and the ground-truth sequence of tool-call actions the assistant should perform. | Config | Description | Train | Test | |--------|-------------|-------|------| | `tasks_base` | Standard tasks covering vehicle controls, navigation, calendar, etc. | 50 | 50 | | `tasks_disambiguation` | Tasks requiring the agent to disambiguate parameters (internally via preferences or by asking the user) | 30 | 26 | | `tasks_hallucination` | Tasks where certain tools/parameters are intentionally removed to test if the agent hallucinates | 48 | 50 | **Task schema:** | Field | Type | Description | |-------|------|-------------| | `task_id` | string | Unique task identifier | | `persona` | string | Description of the simulated user's personality and communication style | | `calendar_id` | string | Reference to a calendar in the mock data | | `instruction` | string | The instruction given to the simulated user | | `context_init_config` | string (JSON) | Initial vehicle and environment state (battery, seats, location, weather, preferences, etc.) | | `actions` | string (JSON) | Ground-truth sequence of tool calls `[{name, kwargs, index, dependent_on_action_index}]` | | `task_type` | string | One of: `base`, `disambiguation_internal`, `disambiguation_user`, `hallucination_missing_tool`, `hallucination_missing_tool_parameter`, `hallucination_missing_tool_response` | | `disambiguation_element_internal` | string or null | What needs to be disambiguated internally (only set in disambiguation tasks) | | `disambiguation_element_user` | string or null | What needs to be disambiguated with the user (only set in disambiguation tasks) | | `disambiguation_element_note` | string or null | Note explaining the disambiguation (only set in disambiguation tasks) | | `removed_part` | string (JSON) or null | Which tools/parameters were removed (only set in hallucination tasks) | ### Mock Data The mock data simulates a realistic car environment database used by the tools during benchmark execution. | Config | Rows | Description | |--------|------|-------------| | `mock_locations` | 48 | European cities with GPS coordinates | | `mock_pois` | 130,693 | Points of interest (airports, bakeries, restaurants, etc.) | | `mock_weather` | 48 | Weather data per location (8 time-slots/day) | | `mock_routes_location_location` | 6,768 | Routes between locations (3 alternatives each) | | `mock_routes_location_poi` | 1,378 | Routes from locations to POIs | | `mock_routes_poi_location` | 1,378 | Routes from POIs to locations | | `mock_routes_index` | 1,763,870 | Route lookup index | | `mock_routes_metadata` | 1,754,346 | Metadata for POI-to-POI route generation | | `mock_calendars` | 100 | Calendar entries with meetings | | `mock_contacts` | 100 | Contact information | ## Usage ### With the CAR-Bench benchmark The [CAR-Bench codebase](https://github.com/CAR-bench/car-bench) loads tasks and mock data from this dataset automatically: ```bash pip install -e . python run.py --model gpt-4.1-mini --task-type base --task-split test --num-tasks 3 ``` ### Standalone ```python from datasets import load_dataset # Load tasks tasks = load_dataset("johanneskirmayr/car-bench-dataset", "tasks_base") print(tasks["test"][0]) # Load mock data locations = load_dataset("johanneskirmayr/car-bench-dataset", "mock_locations", split="train") contacts = load_dataset("johanneskirmayr/car-bench-dataset", "mock_contacts", split="train") # Parse nested JSON fields import json task = tasks["test"][0] context = json.loads(task["context_init_config"]) actions = json.loads(task["actions"]) ``` ## Citation If you use this dataset, please cite the CAR-Bench paper: ```bibtex @misc{kirmayr2026carbenchevaluatingconsistencylimitawareness, title={CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty}, author={Johannes Kirmayr and Lukas Stappen and Elisabeth André}, year={2026}, eprint={2601.22027}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.22027}, } ```
提供机构:
johanneskirmayr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作