five

Salesforce/vibepass

收藏
Hugging Face2026-03-18 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/Salesforce/vibepass
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question_id dtype: string - name: question_title dtype: string - name: question_content dtype: string - name: platform dtype: string - name: starter_code dtype: string - name: difficulty dtype: string - name: silver_solution dtype: string - name: buggy_solution dtype: string - name: test_checker dtype: string splits: - name: test num_bytes: 2182989 num_examples: 173 download_size: 2182989 dataset_size: 2182989 configs: - config_name: default data_files: - split: test path: test.jsonl task_categories: - text-generation language: - en tags: - code - debugging - bug-detection size_categories: - 100<n<1K license: cc-by-4.0 --- # VIBEPASS: Can Vibe Coders Really Pass the Vibe Check? **Authors:** Srijan Bansal, Jiao Fangkai, Yilun Zhou, Austin Xu, Shafiq Joty, Semih Yavuz **TL;DR:** As LLMs shift programming toward human-guided "vibe coding", agentic tools increasingly rely on models to self-diagnose and repair their own subtle faults—a capability central to autonomous software engineering yet never systematically evaluated. VIBEPASS presents the first empirical benchmark that decomposes fault-targeted reasoning into two coupled tasks: **Fault-Triggering Test Generation (FT-Test)**, constructing discriminative witnesses that expose latent bugs, and **Fault-targeted Program Repair (FPR)**, repairing code under varying diagnostic conditions. Our 173 curated examples pair competitive programming problems with LLM-generated solutions that pass partial test suites but fail on semantic edge cases, enabling controlled identification of where the diagnostic chain breaks down. Evaluating 12 frontier LLMs reveals that fault-targeted reasoning does not scale with general coding ability—the binding bottleneck is not code synthesis or test validity but **fault hypothesis generation**, a capability that remains deficient across all frontier models. - 💻 **Github:** https://github.com/SalesforceAIResearch/vibepass - 📜 **Paper:** https://arxiv.org/abs/2603.15921 ## Dataset Structure Each sample in the dataset has the following structure: ```json { "question_id": "Unique identifier for the problem (same as livecodebench)", "question_title": "Title of the coding problem", "question_content": "Full description of the problem including constraints and examples", "platform": "Source platform (e.g., leetcode)", "starter_code": "Initial code template provided to solve the problem", "difficulty": "Problem difficulty level (easy, medium, hard)", "silver_solution": "Correct human solution to the problem", "buggy_solution": "Model-generated solution containing bugs", "test_checker": "Validation code to check test input correctness" } ``` Please refer to https://huggingface.co/datasets/livecodebench/code_generation_lite for official test cases. ## Use Cases This dataset is designed for evaluating LLM coding capabilities: - LLM-as-Judge evaluation for Code Correctenss (bug detection) - Faul-triggering Test Generation (bug localization) - Program Repair and debugging (bug fix) ## Citation ``` @misc{bansal2026vibepassvibecodersreally, title={VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?}, author={Srijan Bansal and Jiao Fangkai and Yilun Zhou and Austin Xu and Shafiq Joty and Semih Yavuz}, year={2026}, eprint={2603.15921}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2603.15921}, } ```
提供机构:
Salesforce
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作