AweAI-Team/BeyondSWE-harbor

Name: AweAI-Team/BeyondSWE-harbor
Creator: AweAI-Team
Published: 2026-03-23 08:14:51
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/AweAI-Team/BeyondSWE-harbor

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 size_categories: - n<1K task_categories: - text-generation pretty_name: BeyondSWE-harbor homepage: https://github.com/AweAI-Team/BeyondSWE tags: - text - datasets - code --- <div align="center"> <h1 style="font-size: 30px; font-weight: 700; line-height: 1.2; margin: 0;"> BeyondSWE-harbor </h1> [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg?logo=arxiv&logoColor=white)](http://arxiv.org/abs/2603.03194) [![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github&logoColor=white)](https://github.com/AweAI-Team/BeyondSWE) [![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-blue)](https://github.com/harbor-framework/harbor) [![Evaluation Framework](https://img.shields.io/badge/%F0%9F%8F%97%EF%B8%8F%20Evaluation%20Framework-harbor-yellow.svg)](https://github.com/AweAI-Team/AweAgent) [![Website](https://img.shields.io/badge/%F0%9F%8C%90_Project-Website-blue.svg)](https://aweai-team.github.io/BeyondSWE/) [![License](https://img.shields.io/badge/License-CC%20BY%204.0-green.svg)](LICENSE) </div> <p align="center"> <img src="figures/beyondswe.png" width="100%" /> </p> This repository provides the **harbor version** of the BeyondSWE benchmark, containing the full task instances in a **directory-based (harbor) format**, where each instance is stored as an independent folder. 📌 For benchmark definition, data format and detailed evaluation results, please refer to: 🤗 [Main Dataset on HuggingFace](https://huggingface.co/datasets/AweAI-Team/BeyondSWE) --- ## 🗂️ Data Structure ```text beyondswe/ ├── {instance_id}/ │ ├── environment/ │ ├── solution/ │ ├── tests/ │ ├── instruction.md │ └── task.toml ``` Each directory corresponds to a single task instance. For more details about the format, please refer to the official repository: 💻 [Harbor Dataset Format (Official)](https://github.com/laude-institute/harbor-datasets) --- ## 🚀 Quick Start We use the [Harbor framework](https://github.com/harbor-framework/harbor) to evaluate coding agents such as **Claude Code** on BeyondSWE-harbor: **1. Install the Dataset** You can download the benchmark data using the `huggingface_hub` library: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="AweAI-Team/BeyondSWE-harbor", repo_type="dataset", local_dir="data", ) ``` **We recommend** using `git clone` to avoid HuggingFace API rate limits: ```bash # Make sure git-lfs is installed: https://git-lfs.com git lfs install git clone https://huggingface.co/datasets/AweAI-Team/BeyondSWE-harbor data ``` This will download two directories into `data/`: - `beyondswe/` — 500 Harbor task directories (each containing `task.toml`, `instruction.md`, `environment/`, `tests/`, `solution/`) - `doc2repo_test_suite/` — test suite ZIP files for Doc2Repo evaluation (already bundled inside each task's `tests/test_suite.zip`, included here for reference) **2. Install Harbor** ```bash uv tool install harbor # or pip install harbor ``` You can see the [Harbor repository](https://github.com/harbor-framework/harbor) for more details. **3. Configure API credentials** To evaluate Claude Code, you will need an Anthropic API key or OAuth token.: ```bash export ANTHROPIC_API_KEY=<YOUR-KEY> # or, if using OAuth: export CLAUDE_CODE_OAUTH_TOKEN=<YOUR-TOKEN> ``` **4. Run evaluation** ```bash harbor run --path data/beyondswe \ --agent claude-code \ --model anthropic/claude-opus-4-6 \ --n-concurrent 1 \ --ak max_turns=200 \ --ak reasoning_effort=high \ --ak "disallowed_tools='Bash(git log * --all*) Bash(git verify-pack *) Bash(git fsck *) Bash(git cat-file *) Bash(git fetch *) Bash(git pull *)'" ``` Key parameters: - `--agent claude-code` — use Claude Code as the coding agent - `--model` — LLM model to use (e.g., `anthropic/claude-opus-4-6`) - `--n-concurrent` - concurrency limit - `--ak max_turns=200` — allow up to 200 agent iterations - `--ak reasoning_effort=high` — enable extended thinking - `--ak disallowed_tools=...` — restrict git history commands to prevent data leakage - `-t <task_name>` — run a specific instance (e.g., `-t pylons_plaster_pastedeploy_pr14`) To see all supported agents, and other options run: ```bash harbor run --help ``` Results will be saved in the `jobs/` directory. Each trial contains: - `result.json` — score, timing, token usage, and exception info - `agent/trajectory.json` — full agent trajectory (steps, tool calls, reasoning) - `verifier/reward.txt` — evaluation reward (1.0 = resolved, 0.0 = failed; Doc2Repo uses fractional scores) --- ## 📝 Citation If you find BeyondSWE useful in your research, please cite our paper: ```bibtex @misc{beyondswe2026, title={BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing}, author={Guoxin Chen and Fanzhe Meng and Jiale Zhao and Minghao Li and Daixuan Cheng and Huatong Song and Jie Chen and Yuzhi Lin and Hui Chen and Xin Zhao and Ruihua Song and Chang Liu and Cheng Chen and Kai Jia and Ji-Rong Wen}, year={2026}, eprint={2603.03194}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03194}, } ``` --- ## 📄 License This project is licensed under the CC BY 4.0 License — see the [LICENSE](LICENSE) file for details.

提供机构：

AweAI-Team

5,000+

优质数据集

54 个

任务类型

进入经典数据集