ronantakizawa/codereview-bench

Name: ronantakizawa/codereview-bench
Creator: ronantakizawa
Published: 2026-03-05 09:16:30
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ronantakizawa/codereview-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en - code tags: - code-review - benchmark - code-generation - software-engineering size_categories: - 10K<n<100K configs: - config_name: code-editing data_files: - split: train path: data/code-editing/train/*.parquet - split: test path: data/code-editing/test/*.parquet - split: validation path: data/code-editing/validation/*.parquet - config_name: comment-generation data_files: - split: train path: data/comment-generation/train/*.parquet - split: test path: data/comment-generation/test/*.parquet - split: validation path: data/comment-generation/validation/*.parquet - config_name: default data_files: - split: train path: data/code-editing/train/*.parquet - split: test path: data/code-editing/test/*.parquet - split: validation path: data/code-editing/validation/*.parquet --- # CodeReview-Bench A benchmark for evaluating models on two code review tasks, curated from [ronantakizawa/github-codereview](https://huggingface.co/datasets/ronantakizawa/github-codereview). ## Tasks ### 1. Code Editing Given code and a reviewer comment, apply the requested change. - **Input**: `before_code`, `reviewer_comment`, `language`, `diff_context` - **Target**: `after_code` ```python from datasets import load_dataset ds = load_dataset("ronantakizawa/codereview-bench", "code-editing") example = ds["test"][0] prompt = f"""Apply the following review comment to the code. Review: {example['reviewer_comment']} Code: {example['before_code']} Updated code:""" ``` ### 2. Comment Generation Given a code diff, generate the review comment a human reviewer would write. - **Input**: `before_code`, `after_code`, `diff_context`, `language` - **Target**: `reviewer_comment` ```python ds = load_dataset("ronantakizawa/codereview-bench", "comment-generation") example = ds["test"][0] prompt = f"""Review the following code change and provide feedback. Before: {example['before_code']} After: {example['after_code']} Review comment:""" ``` ## Filtering Criteria This benchmark is a quality-filtered subset of the full dataset: | Filter | Threshold | |--------|-----------| | Positive examples only | `is_negative = False` | | Quality score | >= 0.5 | | Comment length | >= 50 characters | | Code context | >= 10 lines (before and after) | | Comment types | bug, security, performance, refactor, suggestion | Excluded: nitpick, style, question, and negative examples. ## Schema ### Code Editing | Column | Type | Role | |--------|------|------| | `before_code` | string | Input | | `reviewer_comment` | string | Input | | `language` | string | Input | | `diff_context` | string | Input | | `after_code` | string | Target | | `repo_name` | string | Metadata | | `file_path` | string | Metadata | | `comment_type` | string | Metadata | | `quality_score` | float | Metadata | ### Comment Generation | Column | Type | Role | |--------|------|------| | `before_code` | string | Input | | `after_code` | string | Input | | `diff_context` | string | Input | | `language` | string | Input | | `reviewer_comment` | string | Target | | `repo_name` | string | Metadata | | `file_path` | string | Metadata | | `comment_type` | string | Metadata | | `quality_score` | float | Metadata | ## Evaluation ### Code Editing - **CodeBLEU**: Measures structural and syntactic similarity of generated code - **Exact match**: Percentage of outputs matching the target exactly - **Edit similarity**: Normalized edit distance between generated and target code ### Comment Generation - **BERTScore**: Semantic similarity between generated and reference comments - **ROUGE-L**: Longest common subsequence overlap - **Human evaluation**: Recommended for final assessment — automated metrics correlate poorly with review quality ## Splits | Split | Description | |-------|-------------| | train | Training data (90%) | | test | Held-out evaluation (5%) | | validation | Development/tuning (5%) | Splits are repo-deterministic — no repo appears in multiple splits. ## Citation ```bibtex @dataset{takizawa2026codereviewbench, title={CodeReview-Bench: A Benchmark for Review-Driven Code Changes}, author={Takizawa, Ronan}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ronantakizawa/codereview-bench} } ```

提供机构：

ronantakizawa

5,000+

优质数据集

54 个

任务类型

进入经典数据集