five

bigcodereward

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/bigcode/bigcodereward
下载链接
链接失效反馈
官方服务:
资源简介:
# BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution <p align="center"> <img src="https://raw.githubusercontent.com/bigcode-project/bigcodearena/refs/heads/main/assets/bigcodearena_banner.svg" alt="BigCodeArena" width="800"> </p> [Paper](https://huggingface.co/papers/2510.08697) | [Code](https://github.com/bigcode-project/bigcodearena) | [Project Page (Hugging Face Space)](https://huggingface.co/spaces/bigcode/arena) ## About BigCodeArena **BigCodeArena** is an open human evaluation platform for code generation, built on top of Chatbot Arena with a comprehensive and on-the-fly execution environment. It enables the execution of LLM-generated code and allows humans to interact with execution processes and outcomes, addressing the challenge of manually examining code quality. This dataset contains over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. From these, more than 4,700 multi-turn samples with pairwise human preferences were identified. The data is used to systematically examine code understanding and generation capabilities of frontier LLMs, forming the basis for two curated benchmarks: BigCodeReward and AutoCodeArena. ## Sample Usage This dataset is primarily used with the `BigCodeReward` framework, which evaluates reward model consistency with human preferences on code generation tasks. The following steps, extracted from the [BigCodeArena GitHub repository](https://github.com/bigcode-project/bigcodearena), provide a quick start to evaluate judge models. First, clone the repository and install the dependencies for `BigCodeReward`: ```bash git clone https://github.com/bigcode-project/bigcodearena.git cd bigcodearena # Install dependencies for BigCodeReward cd bigcodereward pip install -r requirements.txt ``` Next, set your API keys (e.g., for OpenAI judge models): ```bash export OPENAI_API_KEY="sk-..." ``` Then, you can evaluate judge models and analyze consistency with human preferences: ```bash # Ensure you are in the bigcodereward directory # cd bigcodereward # Evaluate with execution results (recommended) python eval_hf_data.py --judge-model gpt-4o --workers 8 # Evaluate code-only (without execution) python eval_hf_data.py --judge-model gpt-4o --no-output --workers 8 # Analyze consistency with human preferences python analyze_model_judge_results.py # Compute ELO ratings and correlations python analyze_elo.py ``` ## Citation If you find our dataset or project useful for your research, please cite the following paper: ```bibtex @article{zhuo2025bigcodearena, title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution}, author={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra}, year={2025} } ```

# BigCodeArena:通过执行机制揭示代码生成中更可靠的人类偏好 <p align="center"> <img src="https://raw.githubusercontent.com/bigcode-project/bigcodearena/refs/heads/main/assets/bigcodearena_banner.svg" alt="BigCodeArena" width="800"> </p> [论文](https://huggingface.co/papers/2510.08697) | [代码](https://github.com/bigcode-project/bigcodearena) | [项目页面(Hugging Face Space)](https://huggingface.co/spaces/bigcode/arena) ## 关于BigCodeArena **BigCodeArena**是一款面向代码生成任务的开源人类评估平台,基于聊天机器人竞技场(Chatbot Arena)构建,搭载了全面且实时的执行环境。该平台支持对大语言模型(Large Language Model,LLM)生成的代码进行执行,并允许人类用户与执行过程及结果进行交互,解决了人工审查代码质量的痛点。 本数据集包含超过14000条以代码为核心的原始对话会话,涵盖10种广泛使用的大语言模型、10种编程语言以及8类执行环境。从中筛选出了超过4700条带有成对人类偏好标注的多轮对话样本。该数据集用于系统性地检验前沿大语言模型的代码理解与生成能力,同时为两个精选基准测试——BigCodeReward与AutoCodeArena——提供了核心支撑。 ## 示例用法 本数据集主要配合`BigCodeReward`框架使用,该框架用于在代码生成任务中评估奖励模型与人类偏好的一致性。以下步骤取自[BigCodeArena GitHub仓库](https://github.com/bigcode-project/bigcodearena),可快速上手评估评判模型。 首先,克隆仓库并安装`BigCodeReward`的依赖项: bash git clone https://github.com/bigcode-project/bigcodearena.git cd bigcodearena # 安装BigCodeReward的依赖项 cd bigcodereward pip install -r requirements.txt 接下来,设置你的API密钥(例如用于OpenAI评判模型): bash export OPENAI_API_KEY="sk-..." 随后,你可以开展评判模型的评估并分析其与人类偏好的一致性: bash # 确保你处于bigcodereward目录下 # cd bigcodereward # 基于执行结果进行评估(推荐方式) python eval_hf_data.py --judge-model gpt-4o --workers 8 # 仅基于代码进行评估(不使用执行结果) python eval_hf_data.py --judge-model gpt-4o --no-output --workers 8 # 分析模型评判结果与人类偏好的一致性 python analyze_model_judge_results.py # 计算ELO评分与相关性 python analyze_elo.py ## 引用 如果你的研究中使用了本数据集或项目,请引用以下论文: bibtex @article{zhuo2025bigcodearena, title={BigCodeArena: 通过执行机制揭示代码生成中更可靠的人类偏好}, author={Terry Yue Zhuo, 金晓龙, 刘汉戈, 姜柱永, 刘天扬, 龚晨, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, 徐明, 宋光宇, 李凯欣, 曹雨涵, 刘波, 刘征, Sabina Abdurakhmanova, 温浩宇, 贾孟钊, 姚佳涵, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, 王丁民, 刘佳伟, 王子健, 刘倩, 回斌元, Meg Risdal, Ahsen Khaliq, Atin Sood, 邢振昌, Wasi Uddin Ahmad, John Grundy, David Lo, 朱邦华, 杜晓宁, Torsten Scholak, Leandro von Werra}, year={2025} }
提供机构:
maas
创建时间:
2025-10-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作