AweAI-Team/BeyondSWE-harbor
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AweAI-Team/BeyondSWE-harbor
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
size_categories:
- n<1K
task_categories:
- text-generation
pretty_name: BeyondSWE-harbor
homepage: https://github.com/AweAI-Team/BeyondSWE
tags:
- text
- datasets
- code
---
<div align="center">
<h1 style="font-size: 30px; font-weight: 700; line-height: 1.2; margin: 0;">
BeyondSWE-harbor
</h1>
[](http://arxiv.org/abs/2603.03194)
[](https://github.com/AweAI-Team/BeyondSWE)
[](https://github.com/harbor-framework/harbor)
[](https://github.com/AweAI-Team/AweAgent)
[](https://aweai-team.github.io/BeyondSWE/)
[](LICENSE)
</div>
<p align="center">
<img src="figures/beyondswe.png" width="100%" />
</p>
This repository provides the **harbor version** of the BeyondSWE benchmark, containing the full task instances in a **directory-based (harbor) format**, where each instance is stored as an independent folder.
📌 For benchmark definition, data format and detailed evaluation results, please refer to: 🤗 [Main Dataset on HuggingFace](https://huggingface.co/datasets/AweAI-Team/BeyondSWE)
---
## 🗂️ Data Structure
```text
beyondswe/
├── {instance_id}/
│ ├── environment/
│ ├── solution/
│ ├── tests/
│ ├── instruction.md
│ └── task.toml
```
Each directory corresponds to a single task instance. For more details about the format, please refer to the official repository: 💻 [Harbor Dataset Format (Official)](https://github.com/laude-institute/harbor-datasets)
---
## 🚀 Quick Start
We use the [Harbor framework](https://github.com/harbor-framework/harbor) to evaluate coding agents such as **Claude Code** on BeyondSWE-harbor:
**1. Install the Dataset**
You can download the benchmark data using the `huggingface_hub` library:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="AweAI-Team/BeyondSWE-harbor",
repo_type="dataset",
local_dir="data",
)
```
**We recommend** using `git clone` to avoid HuggingFace API rate limits:
```bash
# Make sure git-lfs is installed: https://git-lfs.com
git lfs install
git clone https://huggingface.co/datasets/AweAI-Team/BeyondSWE-harbor data
```
This will download two directories into `data/`:
- `beyondswe/` — 500 Harbor task directories (each containing `task.toml`, `instruction.md`, `environment/`, `tests/`, `solution/`)
- `doc2repo_test_suite/` — test suite ZIP files for Doc2Repo evaluation (already bundled inside each task's `tests/test_suite.zip`, included here for reference)
**2. Install Harbor**
```bash
uv tool install harbor
# or
pip install harbor
```
You can see the [Harbor repository](https://github.com/harbor-framework/harbor) for more details.
**3. Configure API credentials**
To evaluate Claude Code, you will need an Anthropic API key or OAuth token.:
```bash
export ANTHROPIC_API_KEY=<YOUR-KEY>
# or, if using OAuth:
export CLAUDE_CODE_OAUTH_TOKEN=<YOUR-TOKEN>
```
**4. Run evaluation**
```bash
harbor run --path data/beyondswe \
--agent claude-code \
--model anthropic/claude-opus-4-6 \
--n-concurrent 1 \
--ak max_turns=200 \
--ak reasoning_effort=high \
--ak "disallowed_tools='Bash(git log * --all*) Bash(git verify-pack *) Bash(git fsck *) Bash(git cat-file *) Bash(git fetch *) Bash(git pull *)'"
```
Key parameters:
- `--agent claude-code` — use Claude Code as the coding agent
- `--model` — LLM model to use (e.g., `anthropic/claude-opus-4-6`)
- `--n-concurrent` - concurrency limit
- `--ak max_turns=200` — allow up to 200 agent iterations
- `--ak reasoning_effort=high` — enable extended thinking
- `--ak disallowed_tools=...` — restrict git history commands to prevent data leakage
- `-t <task_name>` — run a specific instance (e.g., `-t pylons_plaster_pastedeploy_pr14`)
To see all supported agents, and other options run:
```bash
harbor run --help
```
Results will be saved in the `jobs/` directory. Each trial contains:
- `result.json` — score, timing, token usage, and exception info
- `agent/trajectory.json` — full agent trajectory (steps, tool calls, reasoning)
- `verifier/reward.txt` — evaluation reward (1.0 = resolved, 0.0 = failed; Doc2Repo uses fractional scores)
---
## 📝 Citation
If you find BeyondSWE useful in your research, please cite our paper:
```bibtex
@misc{beyondswe2026,
title={BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing},
author={Guoxin Chen and Fanzhe Meng and Jiale Zhao and Minghao Li and Daixuan Cheng and Huatong Song and Jie Chen and Yuzhi Lin and Hui Chen and Xin Zhao and Ruihua Song and Chang Liu and Cheng Chen and Kai Jia and Ji-Rong Wen},
year={2026},
eprint={2603.03194},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03194},
}
```
---
## 📄 License
This project is licensed under the CC BY 4.0 License — see the [LICENSE](LICENSE) file for details.
提供机构:
AweAI-Team



