Salesforce/vibepass

Name: Salesforce/vibepass
Creator: Salesforce
Published: 2026-03-18 03:47:30
License: 暂无描述

Hugging Face2026-03-18 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/Salesforce/vibepass

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: question_id dtype: string - name: question_title dtype: string - name: question_content dtype: string - name: platform dtype: string - name: starter_code dtype: string - name: difficulty dtype: string - name: silver_solution dtype: string - name: buggy_solution dtype: string - name: test_checker dtype: string splits: - name: test num_bytes: 2182989 num_examples: 173 download_size: 2182989 dataset_size: 2182989 configs: - config_name: default data_files: - split: test path: test.jsonl task_categories: - text-generation language: - en tags: - code - debugging - bug-detection size_categories: - 100<n<1K license: cc-by-4.0 --- # VIBEPASS: Can Vibe Coders Really Pass the Vibe Check? **Authors:** Srijan Bansal, Jiao Fangkai, Yilun Zhou, Austin Xu, Shafiq Joty, Semih Yavuz **TL;DR:** As LLMs shift programming toward human-guided "vibe coding", agentic tools increasingly rely on models to self-diagnose and repair their own subtle faults—a capability central to autonomous software engineering yet never systematically evaluated. VIBEPASS presents the first empirical benchmark that decomposes fault-targeted reasoning into two coupled tasks: **Fault-Triggering Test Generation (FT-Test)**, constructing discriminative witnesses that expose latent bugs, and **Fault-targeted Program Repair (FPR)**, repairing code under varying diagnostic conditions. Our 173 curated examples pair competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs reveals that fault-targeted reasoning does not scale with general coding ability—the binding bottleneck is not code synthesis or test validity but **fault hypothesis generation**, a capability that remains deficient across all frontier models. - 💻 **Github:** https://github.com/SalesforceAIResearch/vibepass - 📜 **Paper:** https://arxiv.org/abs/2603.15921 ## Dataset Structure Each sample in the dataset has the following structure: ```json { "question_id": "Unique identifier for the problem (same as livecodebench)", "question_title": "Title of the coding problem", "question_content": "Full description of the problem including constraints and examples", "platform": "Source platform (e.g., leetcode)", "starter_code": "Initial code template provided to solve the problem", "difficulty": "Problem difficulty level (easy, medium, hard)", "silver_solution": "Correct human solution to the problem", "buggy_solution": "Model-generated solution containing bugs", "test_checker": "Validation code to check test input correctness" } ``` Please refer to https://huggingface.co/datasets/livecodebench/code_generation_lite for official test cases. ## Use Cases This dataset is designed for evaluating LLM coding capabilities: - LLM-as-Judge evaluation for Code Correctenss (bug detection) - Faul-triggering Test Generation (bug localization) - Program Repair and debugging (bug fix) ## Citation ``` @misc{bansal2026vibepassvibecodersreally, title={VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?}, author={Srijan Bansal and Jiao Fangkai and Yilun Zhou and Austin Xu and Shafiq Joty and Semih Yavuz}, year={2026}, eprint={2603.15921}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2603.15921}, } ```

提供机构：

Salesforce

5,000+

优质数据集

54 个

任务类型

进入经典数据集