autocodearena-v0

Name: autocodearena-v0
Creator: maas
Published: 2026-01-07 17:32:37
License: 暂无描述

魔搭社区2026-01-07 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/bigcode/autocodearena-v0

下载链接

链接失效反馈

官方服务：

资源简介：

# BigCodeArena Dataset **BigCodeArena** is an open human evaluation platform for code generation, built on top of Chatbot Arena with a comprehensive and on-the-fly execution environment. Unlike traditional evaluation platforms, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with execution processes and outcomes, addressing the challenge of manually examining code quality. <p align="center"> <img src="https://raw.githubusercontent.com/bigcode-project/bigcodearena/refs/heads/main/assets/bigcodearena_banner.svg" alt="BigCodeArena" width="800"> </p> This dataset repository, `bigcode/bigcodereward`, contains a subset of the data collected through the BigCodeArena platform. It includes over 14,000 raw code-centric conversation sessions, from which more than 4,700 multi-turn samples with pairwise human preferences were identified. These high-quality human preferences are crucial for evaluating the consistency between reward models and human judgments in code generation scenarios, forming the basis for the **BigCodeReward** benchmark. * **Paper**: [BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution](https://huggingface.co/papers/2510.08697) * **Project Page (Hugging Face Space)**: [BigCodeArena Space](https://huggingface.co/spaces/bigcode/arena) * **Code (GitHub Repository)**: [bigcode-project/bigcodearena](https://github.com/bigcode-project/bigcodearena) ## Sample Usage This dataset is used by the **BigCodeReward** component of the BigCodeArena project to evaluate reward model alignment with human preferences on code generation tasks. The following steps, adapted from the [GitHub repository](https://github.com/bigcode-project/bigcodearena#bigcodereward), demonstrate how to use the data for evaluation. First, clone the `bigcodearena` repository and install the necessary dependencies for `BigCodeReward`: ```bash git clone https://github.com/bigcode-project/bigcodearena.git cd bigcodearena/bigcodereward pip install -r requirements.txt ``` Next, set your API key for the judge models (e.g., OpenAI) and run the evaluation scripts: ```bash # Set your API Key export OPENAI_API_KEY="sk-..." # Evaluate with execution results (recommended) # This will use the BigCodeArena dataset to evaluate the specified judge model. python eval_hf_data.py --judge-model gpt-4o --workers 8 # Evaluate code-only (without execution) python eval_hf_data.py --judge-model gpt-4o --no-output --workers 8 # Analyze consistency with human preferences python analyze_model_judge_results.py # Compute ELO ratings and correlations python analyze_elo.py ``` For more details on `BigCodeReward`, `AutoCodeArena`, and configuration, please refer to the comprehensive [GitHub README](https://github.com/bigcode-project/bigcodearena). ## Citation If you find this dataset or the BigCodeArena project useful for your research, please cite the following paper: ```bibtex @article{zhuo2025bigcodearena, title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution}, author={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra}, year={2025} } ``` ## License This dataset is licensed under the BigCode OpenRAIL-M License.

# BigCodeArena 数据集 **BigCodeArena** 是一款面向代码生成任务的开放式人工评估平台，其构建于 Chatbot Arena 之上，配备了全面且支持实时运行的代码执行环境。与传统评估平台不同，BigCodeArena 可执行大语言模型（LLM）生成的代码，并允许人类用户与代码的执行过程及结果进行交互，从而解决了手动核查代码质量的痛点难题。 <p align="center"> <img src="https://raw.githubusercontent.com/bigcode-project/bigcodearena/refs/heads/main/assets/bigcodearena_banner.svg" alt="BigCodeArena" width="800"> </p> 本数据集仓库 `bigcode/bigcodereward` 收录了通过 BigCodeArena 平台收集的部分数据。其中包含超过 14000 条原始的以代码为核心的对话会话，从中筛选出了 4700 余条带有成对人类偏好标注的多轮交互样本。这些高质量的人类偏好标注数据，对于评估奖励模型在代码生成场景下与人类判断的一致性至关重要，同时也构成了 **BigCodeReward** 基准测试集的核心基础。 * **论文**：[BigCodeArena：通过执行机制揭示代码生成领域更可靠的人类偏好](https://huggingface.co/papers/2510.08697) * **项目页面（Hugging Face Space）**：[BigCodeArena 在线空间](https://huggingface.co/spaces/bigcode/arena) * **代码仓库（GitHub）**：[bigcode-project/bigcodearena](https://github.com/bigcode-project/bigcodearena) ## 示例用法本数据集被 BigCodeArena 项目的 **BigCodeReward** 组件用于评估奖励模型在代码生成任务中与人类偏好的对齐程度。以下步骤改编自该 GitHub 仓库的[官方文档](https://github.com/bigcode-project/bigcodearena#bigcodereward)，展示了如何利用该数据集开展评估工作。首先，克隆 `bigcodearena` 仓库并安装 `BigCodeReward` 所需的依赖项： bash git clone https://github.com/bigcode-project/bigcodearena.git cd bigcodearena/bigcodereward pip install -r requirements.txt 接下来，设置评测模型（例如 OpenAI 模型）的 API 密钥并运行评估脚本： bash # 设置您的 API 密钥 export OPENAI_API_KEY="sk-..." # 基于执行结果进行评估（推荐方式） # 该命令将使用 BigCodeArena 数据集对指定的评测模型进行评估。 python eval_hf_data.py --judge-model gpt-4o --workers 8 # 仅基于代码本身进行评估（不使用执行结果） python eval_hf_data.py --judge-model gpt-4o --no-output --workers 8 # 分析模型判断与人类偏好的一致性 python analyze_model_judge_results.py # 计算 ELO 评分与相关性指标 python analyze_elo.py 有关 `BigCodeReward`、`AutoCodeArena` 及配置的更多细节，请参阅完整的 [GitHub README 文档](https://github.com/bigcode-project/bigcodearena)。 ## 引用声明若您的研究中使用了本数据集或 BigCodeArena 项目，请引用以下论文： bibtex @article{zhuo2025bigcodearena, title={BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution}, author={Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra}, year={2025} } ## 许可证本数据集采用 BigCode OpenRAIL-M 许可证进行授权。

提供机构：

maas

创建时间：

2025-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集