ronantakizawa/codereview-bench
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ronantakizawa/codereview-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
- code
tags:
- code-review
- benchmark
- code-generation
- software-engineering
size_categories:
- 10K<n<100K
configs:
- config_name: code-editing
data_files:
- split: train
path: data/code-editing/train/*.parquet
- split: test
path: data/code-editing/test/*.parquet
- split: validation
path: data/code-editing/validation/*.parquet
- config_name: comment-generation
data_files:
- split: train
path: data/comment-generation/train/*.parquet
- split: test
path: data/comment-generation/test/*.parquet
- split: validation
path: data/comment-generation/validation/*.parquet
- config_name: default
data_files:
- split: train
path: data/code-editing/train/*.parquet
- split: test
path: data/code-editing/test/*.parquet
- split: validation
path: data/code-editing/validation/*.parquet
---
# CodeReview-Bench
A benchmark for evaluating models on two code review tasks, curated from [ronantakizawa/github-codereview](https://huggingface.co/datasets/ronantakizawa/github-codereview).
## Tasks
### 1. Code Editing
Given code and a reviewer comment, apply the requested change.
- **Input**: `before_code`, `reviewer_comment`, `language`, `diff_context`
- **Target**: `after_code`
```python
from datasets import load_dataset
ds = load_dataset("ronantakizawa/codereview-bench", "code-editing")
example = ds["test"][0]
prompt = f"""Apply the following review comment to the code.
Review: {example['reviewer_comment']}
Code:
{example['before_code']}
Updated code:"""
```
### 2. Comment Generation
Given a code diff, generate the review comment a human reviewer would write.
- **Input**: `before_code`, `after_code`, `diff_context`, `language`
- **Target**: `reviewer_comment`
```python
ds = load_dataset("ronantakizawa/codereview-bench", "comment-generation")
example = ds["test"][0]
prompt = f"""Review the following code change and provide feedback.
Before:
{example['before_code']}
After:
{example['after_code']}
Review comment:"""
```
## Filtering Criteria
This benchmark is a quality-filtered subset of the full dataset:
| Filter | Threshold |
|--------|-----------|
| Positive examples only | `is_negative = False` |
| Quality score | >= 0.5 |
| Comment length | >= 50 characters |
| Code context | >= 10 lines (before and after) |
| Comment types | bug, security, performance, refactor, suggestion |
Excluded: nitpick, style, question, and negative examples.
## Schema
### Code Editing
| Column | Type | Role |
|--------|------|------|
| `before_code` | string | Input |
| `reviewer_comment` | string | Input |
| `language` | string | Input |
| `diff_context` | string | Input |
| `after_code` | string | Target |
| `repo_name` | string | Metadata |
| `file_path` | string | Metadata |
| `comment_type` | string | Metadata |
| `quality_score` | float | Metadata |
### Comment Generation
| Column | Type | Role |
|--------|------|------|
| `before_code` | string | Input |
| `after_code` | string | Input |
| `diff_context` | string | Input |
| `language` | string | Input |
| `reviewer_comment` | string | Target |
| `repo_name` | string | Metadata |
| `file_path` | string | Metadata |
| `comment_type` | string | Metadata |
| `quality_score` | float | Metadata |
## Evaluation
### Code Editing
- **CodeBLEU**: Measures structural and syntactic similarity of generated code
- **Exact match**: Percentage of outputs matching the target exactly
- **Edit similarity**: Normalized edit distance between generated and target code
### Comment Generation
- **BERTScore**: Semantic similarity between generated and reference comments
- **ROUGE-L**: Longest common subsequence overlap
- **Human evaluation**: Recommended for final assessment — automated metrics correlate poorly with review quality
## Splits
| Split | Description |
|-------|-------------|
| train | Training data (90%) |
| test | Held-out evaluation (5%) |
| validation | Development/tuning (5%) |
Splits are repo-deterministic — no repo appears in multiple splits.
## Citation
```bibtex
@dataset{takizawa2026codereviewbench,
title={CodeReview-Bench: A Benchmark for Review-Driven Code Changes},
author={Takizawa, Ronan},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/ronantakizawa/codereview-bench}
}
```
提供机构:
ronantakizawa



